If you only read one of my series of posts this lifetime you might want to make it this one. It is certainly not my most coherent or interesting series but it is the one that is likely to save you the most grief over the next few years, (assuming you have something to do with SharePoint deployments).
This was going to be a single post but I got a bit carried away so I've split it up in to the following posts:
- This one - An Overview of the Issues
- The next one - An Overview of the Potential Solutions.
- The one after that - EBS vs. RBS...the ultimate grudge match (because I begrudge having to deal with it.)
- Maybe a bonus final posting but probably not.
An Overview of the Issues
I have dedicated a lot of posts in the past talking about how and why it makes sense to create a nice symbiotic relationship between your ECM system of choice and SharePoint. I won't ramble on about silos, compliance, long term archiving, over duplication of content, scalability etc. ad nauseam (again) instead I will take a different tack.
Imagine a world where content created in SharePoint was automatically routed to the most appropriate location depending on factors such as values in the object's attributes, where the object is in its lifecycle and/or who created it. Imagine that this was done without in any way affecting the SharePoint end user experience or any applications built on top of SharePoint. Imagine if doing this didn't just reduce risk and costs but it also made your SharePoint deployments more scalable and robust.
Before we continue to dream of such a thing let's talk about one of the fundamental issues with the SharePoint architecture. SharePoint stores everything in SQL Server, not just the structured content but also the unstructured content, (the actual binaries - Word documents, PDFs, JPGs, etc.). If we could get these binaries out of SQL Server and manage them in a more appropriate way then many of the limitations and concerns around SharePoint are lessened.
Let's start by reviewing why getting content out of SharePoint’s back end is a big deal? You might be asking what's wrong with how SharePoint manages content today; to save you going back over my Blog to read the answers let me give you a high-level overview.
How SharePoint Manages Content Today.
- If you import a 10GB object in to SharePoint and fill out the properties screen the 10GB object and its associated metadata cascade down the SharePoint stack and both end up in SQL Server.
- The attributes get written to a database table and the 10GB file is also stored in the database as a Binary Large OBject (BLOB ).
- In my ever so humble opinion the only thing good about BLOBs is the cool acronym. When I was a DB admin we were told never to store binaries in the database unless there were lots of tiny winy wittle files - XML for example.
- In fact, many of the limitations that you hear about SharePoint's scalability actually come from SQL Server trying to store the BLOBs not specifically from SharePoint or IIS.
In summary, storing the BLOBs in SQL Server creates issues related to scalability and related to the creation of silos.
How bad is using BLOBs really?
That really depends on who you are and what SharePoint is being used for. For small, non-regulated deployments you may not care, however if you want to use SharePoint as part of your enterprise infrastructure then it potentially really sucks. For example, there's a good chance that you will be forced to structure your site and farm topology based on SQL Server capacity rather than based on the business need. Technology driving deployments is never a good idea...in fact it always bites you in the end (the rear end usually).
What would avoiding Database BLOBs give me really?
Oh, I am so glad I asked, (I am a Gemini so this split personality thing is perfectly normal according to my team of psychiatrists). I use the term “Data Aggregation” in my architectural postings to describe the concept of storing all of the binary objects in a single location. Here's the picture to hold in your mind: binary objects that are being managed by SharePoint would be stored in a single centralized system; not all of them all of the time but whenever it makes sense. Along with the binary object we also take a "convenience copy" of some of the object's metadata and any other contextual information of interest, (the folder it came from for example). This metadata and context data is captured simply to allow us to be able to work intelligently with the object.
So what would this aggregated view give me? Let's break it down in to three key areas:
- Operational Efficiencies
- Use the metadata values to intelligently to perform HSM on the objects – i.e. move content to different storage devices based on a set of rules. This reduces hardware costs considerably.
- While you are at it why not de-duplicate the content, (think of the storage and backup savings there). This de-duplication is not just within a single site, you could de-dupe across all SharePoint sites in the entire company. Bear in mind that if you have versioning switched on in SharePoint it tends to create a huge number of identical copies of documents…more than you might expect.
- Don't stop there; you can also more efficiently deploy your SQL Servers because they will scale up the wazoo if they only contain the structured data. ("up the wazoo" is a DBA technical term.)
- Backup and recovery…well the jury is out on this one. Depending on where you aggregate your content to and how you do your backups (hot vs. offline) this could either be hugely efficient or a bit of a nightmare. We are working on that one - if you have any ideas contact me directly for a chat.
- You will certainly increase the scalability of your deployments by removing the payload from SQL server; this alone might justify the effort to aggregate your data,
- Many companies do not manage all of their SharePoint deployments from within the data center simply because they do not have the capacity to service that many separate systems. This aggregation approach means that IT could allow departmental rollouts of SharePoint but they would mandate that the deployments utilize the aggregated store so IT has responsibility for the actual content.
- Governance, Risk and Compliance (GRC)
- Once you have all, or a subset, of your content in a central repository you can start to apply good governance controls to it. If it is an ECM system you can start applying retention controls, you can apply digital rights management to objects and have more robust data protection/audit controls.
- Leveraging the HSM feature you can apply different levels of control to different categories of content. For example, some content might just be stored on a secure file system, other content might be dumped in to your ECM system and some might go to a CAS device.
- Probably the biggest compliance advantage is the ability to centralize the application and management of your controls. Given that all of your corporate assets are now in one place you can apply a common set of controls to them – one file plan, one set of retention schedules and one disposition process.
- Business Gains
- What about your end users? To be fair, they’d probably love this system simply because it is non-invasive but we can throw them another bone or two just to be kind.
- eDiscovery is a case in point – obviously having everything in one place, de-duplicated and indexed is a boon for the legal-eagles but there is another bonus feature…as I mention in the first list, right now there is a plethora of content lurking in the hidden department-deployed SharePoint systems; aggregation means that you can ensure that you have visibility in to all of this content.
- Long term archiving is another interesting use-case; this could be compliance driven or just good business practice. If your content resides in SharePoint then in order to have access to the content in 20 years time you will have to still have SharePoint running. This is neither attractive nor realistic! However, if the key business documents have been archived out with their metadata then you have many more options such as leaving them in your existing long term repository, exporting them to a file system with an XML tag file, storing them on a CAS device, etc.
So, hopefully you now have a feel for why you might want to take a look at aggregating your content in to one location. In the next post I’ll discuss the different approaches that are available and then in the third post I’ll focus on two new capabilities that Microsoft have made available – EBS and RBS.