Google’s enterprise IT teams build and maintain the infrastructure that powers Google. Our mission is to enable Googlers to do their best work by providing them with the reliable, secure, and scalable infrastructure they need. Our vision is to be the world’s leading provider of enterprise technology solutions.
In order to best meet the needs of Googlers, we knew we needed to take advantage of the scale and operational standards of our own cloud infrastructure – and this transformation has been effective; it’s clear our migrations eliminated low value, low-level and toilsome work, up-leveled the skills of our team, and ultimately gave them more leveraged tools to provide value added features to our users.
Given that we’ve completed our lift-and-shift (that included over 100 vendor supplied services) and a subsequent cloud modernization effort, we want to share our journey to this point, the benefits we reaped, as well as the associated risks uncovered along the way.
Looking back – cloud enablement and lift and shift (2016 to 2019)
After a brief discovery and prototyping phase to allow our team to acclimate to Google Cloud, our move to cloud started in earnest with an in-depth assessment of existing corporate infrastructure and our corporate workloads. This phase was focused on a lift and shift as the other compute runtimes were still in their infancy at the time.
Our cataloging effort was relatively manual as many lift-and-shift tools such as Migrate for Compute Engine did not exist back then. Our main activities included data source discovery/cleaning/joining, the creation of homemade scanning utilities, and conducting interviews to sort through over 20 years of tribal knowledge – enterprise archeology at its best!
To make this cataloging work tenable at scale, we distributed the work across service and infrastructure teams, while maintaining a central source of truth.
For services, we asked service owners to curate user-journeys to identify and prioritize their key functionality and user groups. We similarly asked the infrastructure teams to describe their components in terms of technology agnostic features and performance specifications. We then mapped our prioritized service user journeys to the underlying infrastructure capabilities, allowing us to quickly identify services that could move easily and those for which there were migration blockers, since the corresponding functionality did not exist in Google Cloud.
Blockers were shared with our colleagues in the Google Cloud organization for further study. These conversations led us to either adapt and modernize our approach to delivering a particular scenario, or for them to add new features to the platform. Over the years, these conversations have led to over 110 new enterprise-ready features including the creation of Identity Aware Proxy, Cloud SQL, NFS volumes, IAM deny policies, hierarchical firewalls, and packet mirroring (to name a few).
From the services that had no blockers, we chose three of varying complexity to canary a migration, inviting Google Cloud developers, customer engineers, and our corporate engineering teams to get some hands-on experience in migrating real world applications.
We then paused to evaluate all the toilsome parts of these deployments and developed tooling to automate and standardize our cloud deployments. This included tools in the following large buckets:
Landing Zone Automation – using Terraform we created a set of customizable, repeatable and secure by default landing zones.Configuration managers – to maintain landing zone security posture and general configuration integrity over time, we created tools to detect and correct drift.Bridge Systems – till our migration was complete, we required bridges to corporate infrastructure (e.g. connections to our corporate DHCP, DNS, auth systems, machine inventory systems) as well as connections to the services that had not yet migrated.
For all unblocked services, we then scaled our migration first by providing white glove migration assistance to our early adopters; gaining momentum and steering our tooling and approaches towards self-service.
For blocked migrations, we notified service owners whenever a supporting feature was released for them to trial and, subsequently, attempt their migration- providing first-hand feedback to Google Cloud along the way.
Over two years we migrated dozens of applications to Google Cloud. The most complex of which was our migration from Oracle back-ends to “SAP on Google Cloud”, which inspired the now standard deployment strategy.
In closing, while we were worried early on about some cultural resistance to our move (due to concerns about role redundancy and reduced customizability of the cloud environment) – these worries turned out to be unfounded.
Present state: Cloud Modernization (2019 to 2024)
As our Google Compute Engine ecosystem matured and our cloud footprint grew, we started looking for avenues that supported our growing scale.
At this point, similarly to other large enterprises, Google’s cloud journey matured from enablement to modernization; from a lift and shift effort to one that’s focused on adding capabilities while increasing efficiency.
To embody this shift, we took a step back and worked to define the end-to-end requirements of an enterprise service.
We found that, generally speaking, enterprise workloads tend to share a common set of architectural building blocks: Front-end / UX, Connectivity, Compute and miscellaneous integrations.
While this wasn’t unexpected, it was interesting; for years, there’s been a prevailing internal misconception that Google is a snowflake – requiring solutions and features that no other enterprise does.
We now know that all enterprises are snowflakes – they all look the same from a distance, but zoom in enough and you’ll see that every one of them is unique.
With that in mind, we did have to be cognizant that some requirements, mostly those stemming from Google being both a cloud consumer and its own provider, are indeed unique.
Our new cloud platform of choice, Google Kubernetes Engine (GKE), allowed for rapid scalability, repeatability and manageability, thus creating a modern cloud ecosystem for Google’s enterprise IT.
The next step was to map the above building blocks to Google Cloud services, creating increased dependencies on new architectural elements like Filestore, Cloud SQL and more.
Having said that, this process introduced new challenges:
Multi-environment enterprise: GKE is clearly the right solution for some workloads (e.g., most off the shelf software, bare-metal VMs, etc) while GCE is still a superior solution for others (e.g., workloads without a docker image). In addition, our traditional data center footprint is still necessary and being used for other use cases.
This means we’re running a multi-environment enterprise with ever-changing volumes and demand pools.New gaps: as we shifted our focus towards GKE, new gaps were discovered. Some of these were already resolved for the mature GCE platform and had to be refitted to the nascent one.Early adopters: in many cases, we were customer zero – the first enterprise to use a new feature set or even product. That meant growing pains and suboptimal performance but also the ability to help shape the final products that make Google Cloud a great enterprise platform.
Looking forward: scale, governance and optimization
What does the finish line of a cloud migration look like? Is it enough to lift-and-shift and replatform existing applications? We propose that even though it might be satisfying to move your last on-premise application to cloud, that is just the beginning, and that cloud technologies enable and accelerate the IT infrastructure continuous improvement loop. In our experience, rebaselining to Cloud has enabled us to continue to innovate across the following dimensions:
Configuration managementGovernance, change management and policySecurityOperationsChange Management
While this transformation modernized our ecosystem and ensured we were able to avail of the economies of scale benefits of Google’s Cloud ecosystem, it also introduced a new set of enterprise challenges, risks and mitigation strategies:
Enterprise governance:
Introducing new tech is only the first step; as an enterprise you have to ensure your users are using it in a safe and efficient manner. Achieving this goal is complex because it requires both technology driven guardrails and cultural shifts (at least in agile organizations).
This also surfaces a cultural issue: how do you maintain agency within your organization while being highly opinionated about your users’ use of technology?
One thing we learned is that mandates don’t work – it turns out that people don’t like to be told what to do without understanding why they should do it. This means that any kind of opinionated guardrail should present a true benefit to the user. We’ve done this in the form of security pre-approved usage patterns – these allow users to follow pre-defined architectural blueprints as they’re working to stand up their workload without having to go through a lengthy security review process. By creating a symbiotic well lit path, everyone wins; the user gets a lighter lift en-route to a new workload while the organization gets reduced toil and risk.
Multi footprint optimization issue
Unblocking new tech also elevates the risk of generating a “success disaster”; new tech, and especially new platforms, means increased cost and toil. This quickly becomes an optimization problem. If you want to fully manage a platform, how many different elements can it contain before the managed becomes unmanageable? Tech convergence becomes a key element in this continuous tech battle and while converging on a minimal set of platforms (along with the aforementioned tech governance) is costly in the short-to-medium term, it results in a lean tech stack with minimum toil in the long run.
It has been an exciting journey thus far and we look forward to continuing to revolutionize our enterprise IT using Google Cloud’s powerful capabilities. If you’d like to read more about how our Google’s enterprise IT teams run Information Technology (IT) at Google, and learn how we scale and adapt to the future of work, take a look at our Corporate Engineering Insights page.