One of the most commonly asked questions these days is “Should all our source code be in one repository?” This is a complex question and leads to a somewhat interesting set of answers.
Before we get to that lets try and understand the question a little more and find out why customers asking this? In IT we like to centralize and optimize. Gathering all the code in one place is seen as the next logical set of distributed data ripe for centralization and optimization. All in one place means we can manage access better, manage backup and recovery better and ensure everyone is able to maximize the reuse of code.
However this flies in the face of modern developer behavior. At large and small IT organizations we see developers downloading open source source-code management systems for themselves and their teams. Instead of having one repository in one place we are seeing repositories on every server and developer hard drive creating a vast digital archipelago of repositories where processes and standards evolve on a team by team basis mimicking the finches on Galapagos recorded by Darwin.
And this is the dilemma. Corporate responsibility drives towards a single repository strategy but developer behavior wants local control and ownership of the their code.
What does corporate want?
So what does corporate really want when they say they want a single repository? Typically they are trying to address multiple concerns and typically these are they:
- Visibility into all the artifacts in the repository
- Central access control over the artifacts
- Conformance to governance guidelines and audit reporting requirements
- Segmentation of the artifacts to match separation of duties mandates
- Support for shared code and refactoring initiatives
- Enterprise wide impact analysis
- Control over misuse, misappropriation and malicious activities
- Consistent backup of the repository
None of these are architectural in nature: they are all functional requirements that are easy to satisfy with a single repository and very difficult, impossible in some cases, to achieve with team-based repositories.
What do developers want?
Developers want the least amount of technology and process in order for them to develop at speed. To, as Mark Zuckerberg described it, “move fast and break things.” This means:
- Solutions they can obtain without budgetary permission
- A repository that is easy to use and flexible to their needs
- Low process, governance and control
- Easy (or no) administration
- Simple (or no) licensing
- Fast checkout and checkins (especially GetLatestVersion) across the LAN and WAN
Once again, these requirements are not architectural. They too are just a list of requirements. While they seem in conflict with what corporate governance demands there is common ground and a proper technical solution that meets both sets of requirements is possible.
Developers fear having their code hosted on a platform that they are not developing for. Mainframe developers would never countenance their COBOL code hosted on Windows, no Unix developer would accept their code hosted their either. Developers in Beijing find it hard to accept their code hosted in Bulgaria and managed from Boston. Add to this the numerous code pages and, perhaps, ASCII to EBCDIC conversion issues that would ensue.
Most developers these days use code analysis tools designed for the development platform they are using so this means keeping the code on that platform and that in turn means duplicating the code from the single repository back to the distributed platforms.
As I said at the beginning this question raises many interesting issues. None is more pressing than this though.
Neither of these positions, single repository versus multiple distributed repositories, takes into account is that the source code repository represents the collected intellectual property of the corporation. It is a business’ most valuable asset, far beyond the goods and services they provide, and this is why it has become the single target and focus of hostile foreign governments, unscrupulous competitors, disgruntled employees and organized crime.
Secure SDLC: the next standard in repositories.
In tomorrow’s repository the design needs to represent best practices in secure data management. Protection of the repository is of utmost importance. This means that our repository must have:
- Single point of access control
- Robust auditing
- Encryption of artifacts
- Tamper detection of artifacts, logs, audit trails, reports and the software itself
The ideal repository architecture
What makes the ideal repository architecture is neither single nor multiple repositories.
Here are the key ingredients and, as you will see, they satisfy all the corporate and all the developer needs:
- Secure repository defended against exfiltration and infiltration of code
- Process centric allowing enforcement of one (or many) development processes irrespective of platform and in support of all development methodologies
- Secure, immutable logging and audit trails
- Single point of user and tool administration
- Artifacts stored on the platform of choice by developers
- Artifacts backed up by native utilities optimized for that platform
- High speed performance over LAN and WAN
- High speed performance irrespective of the user load, irrespective of the size of the repository and irrespective volume of versions and changes being managed and tracked
- Caching of a minimal amount of code as needed reducing duplication and limiting misuse
We call this a Single Virtual Repository.
From a management and administration point of view it appears as a single repository but, behind the scenes, the SCCM software manages all the artifacts in their respective locations on their respective platforms.
From a developer’s point of view their code is collocated with the team allowing for the fastest possible access. It also means that code analysis tools are able to execute on the code natively without duplicating the code. Each team can have their own, or a mandated process, as processes and access rules can be defined at a project or even an artifact level.
Central control but distributed data.