Read more about the eight reference architectures.
Let me preface this entry by saying that this reference architecture is similar to reference architecture #1 insomuch as it is not something that I recommend or endorse but it is something that I see in use fairly frequently today; in fact many of the solutions recommended by Microsoft fall in to this category. Best case, feel free to use this approach while you wait for more appropriate solutions to come along.
In the previous architectures we were trying to create a unified model so that an end user could simultaneously see content that resides in the SharePoint systems and also the traditional ECM systems. In this, (and the next), architecture we are going to directly deal with the SharePoint "silo sprawl" in the data center. I call this approach 'aggregation' not unification. The advantage of architectures 6 & 7 is that they are completely transparent to the end user - that's also their failing; these architectures are so transparent that you cannot use them to expose the advanced capabilities of the underlying ECM solutions to the end users. FYI - I am working on a cocktail of solutions that give you the best of both of these worlds but I'll probably keep that under wraps until I have it ironed out a lot more.
The general premise here is to leave each instance of SharePoint running independently as individual disparate systems but to create a virtual aggregation of the content and potentially the metadata behind the scenes. This approach does not actually change any part of SharePoint, the reference architecture's processes and technologies simply monitor the SharePoint implementations and create an aggregated view of the content stored across all libraries. This aggregated view can then be used to make security decisions, perform risk analysis, monitor file usage, etc.
For example, one could create a master index of all content in all SharePoint instances inside of the organization. This master index would contain the basic information about the contents of all of the document libraries along with the security access settings and a hash value of each of the pieces of content. You could then analyze this information and look for instances where a file’s security is not consistent. Let's consider a use case: assume that the company’s year-end financial results are stored in the finance department’s SharePoint document library with very restrictive security in the weeks before the official financial results are announced. If the passive unification process looked in the master repository index and noticed that this same file was also available on one of the finance department's collaborative SharePoint sites with no security then you might want to do something about that! Note that it would know that the two files were the same file even if the file name had been changed because the hash value would uniquely identify the file.
There are plenty of other use-cases for what to do with this data once you have access - monitoring data duplication, managing disposition, etc. but is this really an aggregated back-end? It isn't really because nothing is actually aggregated in reality but it can be useful because it creates a virtual unified view of the data and once again it is a very non-invasive, low impact approach to the solution.
One issue to consider with this approach is scalability. If you have a billion data objects in your SharePoint systems then your enterprise-wide content index would have a billion entries in it. The other consideration is how to keep this index synchronized as content is added, altered and especially removed from the systems. Those types of operations are very CPU-heavy.
Talk to the market leaders in database, search and indexing technology for advice on these issues. A billion objects synchronized across an enterprise is theoretically within the abilities of most enterprise database systems however I have talked to customers who generate over 1.5 billion records a day, (yes, that's the right number), these are transactional records currently ingested in to specialized storage but they could easily be stored as XML output files. Assume that they generate 1.5 billion records each working day they will end up with roughly 390 billion objects a year! Scalability becomes an issue very quickly with this kind of ingestion.
Again, I see this as being an architecture that is included for completeness and because it contains some concepts that are useful elsewhere, (think eDiscovery for example). Also, it can be "done on the cheap"; you can achieve passive virtual aggregation using the tools that you own today. A more useful reference architecture is to create some form of real unification at the back end...see the next reference architecture for more on that thrilling concept.
Comments