Smart Snippets turn web pages into databases

How do you describe a business? What about a person, or an intellectual work? There’s an interesting little secret that people in IT likely know, but that doesn’t always get to the C-Suite. Programming, at its core, is all about creating models. Sometimes those models are of classes of things, sometimes they better describe processes, but it is rare for a piece of software in your organization to not have some relevance to perhaps a few dozen critical types of things.

In large enterprises, it’s not at all uncommon for that organization to go through a form of fire drill known as “creating the enterprise data model” (in TLA-speak, “EDM”). This particular ritual is initiated by business analysts who talk in hushed tones about data dictionaries, cardinality rules, associations and constraints. There are almost always drawings drawn, typically with reference to entity relationship diagrams, with lots of boxes and ovals and arrows, all neatly tied up in hushed debates about whether UML 1 or UML 2 rules apply and whether JSON or XML schema is the better denormalized form for handling streaming. Blood has been known to be drawn in these encounters. The end result of this, almost invariably, is a big, complex document called a schema, which is then placed in a folder on Sharepoint while the programmers merrily ignore everything in it, until they get upset when their applications don’t work with the ones across the hallway and realize that they needed to figure out inter-operability.

The effort of creating such schemas can often be time consuming, and in the process opens up the potential for different groups within an organization to seek solutions that are optimized for their requirements, even if they are inconvenient for others. For people who work with such data dictionaries – business analysts, taxonomists and ontologists – this struggle was an inevitable part of defining an organization’s data language, but it didn’t mean that it was an enjoyable part.

Yet one thing emerges for those who deal with the language of data – when you get right down to it, there is actually a pretty minimal subset of things that matters to a business or organization, and these can be modeled in the same way. While there are always variations and additions, most organizations have common structures – divisions, employees, facilities, customers and so forth. While address forms may vary somewhat, most of it tends to be uniform. Even paperwork – invoices, purchase orders, calendar events, etc. – have a lot of commonality.

Various organizations have built schemas around these, but because this information is so frequently accessed, Microsoft and Google (along with Russian search engine Yandex) came together in 2011 to establish a website called Initially, it’s purpose was to provide a home for various public schemas and microtagging languages, but starting in 2014, its focused shifted into consolidating these languages and creating a set of, well, schemas, that organizations could use to describe their own business languages. This was not the first such effort – in the mid-1990s, a set of “tags” were created under the auspices of NCSA called Dublin Core (after Dublin, Ohio, where much of this work was done). With the rise of both XML and the semantic Web, Dublin Core, which focused primarily on publishing information, was refactored as the Dublin Core Information Model (or DCIM). Not surprisingly, proceeded to slurp most of the DCIM terms into its own specification as well.


Today, there are nearly six hundred distinct types, in areas as diverse as

These can get into surprising detail – the Organization set itself includes sixty one distinct properties (some strings of text or numbers, some other object types), and organizations in…