Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Developed by: Kurt Garloff, Friederike Zelke, Max Wolfs, Artem Goncharov (all SCS @ OSB Alliance), Nico Lück (GIZ), Ramkumar Permachanahalli, Walid Mathlouthi, Sreepathy H Vardarajan, Ayush Shukla, Yolanda Martinez, Hani Eskandar (ITU) in cooperation with GIZ, ITU, DIAL, and the Government of Estonia
The terminology used within this specification.
Virtualization
The process of creating an abstraction layer over computer hardware (storage, network, compute) that allows a computer to share its hardware with multiple virtual seperated environments.
Virtual Machines (VM)
The virtual hardware with virtual CPUs, memory (RAM), disks, network adapters where consumers can run an Operating System and Software of their choice.
Hypervisor
Software that creates and runs virtual machines by abstracting the hardware and allowing multiple operating systems to run concurrently on a host computer.
Containerization
A form of lightweight virtualization that involves encapsulating an application and its dependencies into a container that can run on any computing environment.
Container
A process or set of processes that have their private and isolated view of the file system and network and compute capacity. All containers on a (virtual) machine share the same operating system.
Multi-tenancy
An architecture in which a single instance of software runs on a server and serves multiple tenants (users or organizations), ensuring secure isolation between them.
Federation
The integration of multiple systems or organizations, allowing them to share resources and manage user identities across different domains while maintaining autonomy.
IAM (Identity and Access Management)
A framework of policies and technologies for ensuring that the right individuals have access to the right resources at the right times for the right reasons.
Region
A geographic area where cloud services and resources are deployed, typically consisting of multiple well-interconnected data centers to provide redundancy and ensure low-latency performance.
Availability Zone
A distinct location within a region that is engineered to be isolated from failures in other zones, providing high availability and fault tolerance.
Quota
Usage limits per kind of resource. Users can only create a limited amount of resources to avoid overly high bills or overly high consumption of limited resources.
UUID
Unique identifier. Typically a random number attached to a newly created resource and then used to uniquely identify and reference it. Commonly 128bit numbers in the format a78622a8-1177-47af-b5da-3378ee5d4313
are used. Other lengths and formatting are possible.
Infrastructure as Code
Virtual infrastructure (software defined storage, network, compute) is managed like code.
This section lists the technical capabilities of this Building Block.
Asset management for the physical infrastructure and automation to deploy the basic infrastructure tooling for rolling out the virtualization management platform and operational tooling (REQUIRED).
Virtualization management platform that supports Virtualization of compute, storage, and network resources on top of commodity server hardware (REQUIRED).
Compute virtualization offers the creation of virtual machines of various standardized sizes/properties (with respect to virtual CPUs and memory) (REQUIRED).
The available sizes/properties may be limited to a list of predefined templates (flavors) or be chosen freely (OPTIONAL).
Optionally, virtualized or pass-through GPU compute capacity is exposed to the VMs (OPTIONAL).
There is a scheduling mechanism that finds the most suitable host that fulfills all the constraints from the user (REQUIRED).
Situations where common VM flavors can no longer be scheduled due to capacity shortage should be avoided (RECOMMENDED).
Users must have ways to prevent or force scheduling of VMs on the same host (REQUIRED)
Users may express softer preferences for the placement (RECOMMENDED).
The virtual machines are booted using operating system images. Providers provide a set of regularly updated standard images (REQUIRED).
The origin (or the custom build process) must be documented and metadata be available that allows users to see what patch status the image has and what updates and lifetime to expect (REQUIRED).
The provided standard OS images must support the processing of injected user-data for customization using mechanisms like cloud-init or the established alternatives (REQUIRED), see chapter 7.4.
The virtualization platform must allow users to upload custom images (REQUIRED). This may be subject to quota limits or billing to compensate for the associated storage cost.
There must be a way for users to reliably inject per-VM instance data that is being consumed by the VM on (first) boot to customize the instance’s configuration and role (REQUIRED).
Securely isolated private networks can be created (SDN = Software Defined Networking) and VMs attached to them (REQUIRED).
It must be possible to connect networks via routers (REQUIRED).
User-defined rule sets determine which communication is allowed between virtual networks and VMs (REQUIRED).
The virtual networks can be connected to external networks, providing outbound and inbound connectivity. It must be possible to do this by attaching additional IP addresses from the external network to existing internal network ports (REQUIRED).
In public clouds, it must be possible to connect to the internet this way (CONDITIONALLY REQUIRED) – in private clouds, the connectivity to the internet may be heavily restricted on non-existent. In any case, the user-defined rule sets for connections apply also for external networks connected this way.
The network provides user-controlled load-balancers which redirect traffic to a set of backend services depending on their load and availability (REQUIRED).
The storage subsystem exposes block storage, that can be attached as virtual hard disks to the VMs (REQUIRED).
The storage subsystem supports user controlled snapshots and backups (REQUIRED).
The storage also exposes a standardized object storage interface (REQUIRED).
The storage offers several storage classes with different performance and encryption attributes (RECOMMENDED) whose properties are documented and discoverable.
Optionally supporting dedicated storage solutions (OPTIONAL).
There is a key management function that can be used to securely handle secrets for storage and network encryption (RECOMMENDED).
Optionally, the platform provides a Domain Name Service capability (RECOMMENDED).
The provider must ensure security updates for the cloud management software are deployed, notifying users when this leads to workload or control plane disruption (REQUIRED).
Providers should minimize disruption e.g. by using live migration of VMs for host maintenance (RECOMMENDED).
All of the virtualized resources are controlled by REST APIs that allow on-demand self-service for authenticated platform users (REQUIRED).
The virtualization platform offers a service catalog that allows to discover the available services, how to access them and what optional features the services support (REQUIRED).
The REST APIs must be documented and accessible over public internet (REQUIRED).
The REST APIs must be supported by common Infrastructure as Code Tools (REQUIRED).
Ideally they are covered by an OpenAPI 3.1+ specification (RECOMMENDED).
The capabilities for users are limited by their assigned roles (REQUIRED) and by quotas that cap resource usage (REQUIRED).
Role assignment via groups and user assignment to groups must be supported (REQUIRED).
There must be predefined roles for read-only access (REQUIRED). (See also chapter 7.6)
API calls use tokens or client certs that have limited life time of maximum one day (REQUIRED), shorter is recommended.
It must be possible to revoke tokens/certs (REQUIRED).
Compute resources are split into several (ideally three or more) availability zones. These zones have a meaningful independence from each other on the hardware level in terms of power supply, backup power, cooling, fire protection, core routers, network connectivity, etc. in order to have a significantly lower chance of more than one availability zone failure compared to a single availability zone failure (RECOMMENDED).
The zones need to be connected with high-bandwidth low-latency networks (REQUIRED).
The details of connectivity and independence may vary but need to be made transparent (REQUIRED).
This allows for users to create services that survive common hardware outage scenarios. The storage and especially the network resources are best designed to work across the availability zones and survive the failure of one (by using redundancy) (RECOMMENDED);
if they are not resilient to availability zone failure, they must be also be provided separately per availability zone (CONDITIONALLY REQUIRED).
Providers may offer multiple regions (OPTIONAL).
Regions may just support federation like with other compatible providers; they may offer additional synchronisation for user convenience (OPTIONAL).
Regions must not depend on anything from another region to be fully functional (REQUIRED).
The web interface should be configured to require two-factor authentication (RECOMMENDED).
There is a Self-service function to create Container Orchestration Clusters within customer projects on-demand (REQUIRED).
The offered APIs are discoverable (REQUIRED).
The size of these clusters (how many machines with what size) can be chosen by users (REQUIRED).
It must be possible to distribute clusters over several physical machines to ensure resiliency against outage of single physical machines (REQUIRED).
The clusters can be enlarged or reduced in size on the fly (based on the user configuration) without disruption (as long as users don’t request to go below a minimal size) (RECOMMENDED).
The container orchestration software can be updated without disruption to the user workload nor the control place (REQUIRED). (This may be subject to minimal size requirements again, a single node cluster can not do rolling upgrades, obviously.)
There must be a documented way to get a metrics service deployed that allows to observe the load on the system (REQUIRED).
It is recommended that this is enabled by default, i.e. implemented as opt-out (RECOMMENDED).
There must be clear responsibilities to provide (REQUIRED)
Optionally, install security upgrades to the orchestration software by the provider (OPTIONAL).
The container clusters must provide access to persistent storage (REQUIRED).
The container clusters must provide ways to control securely isolated networking between containers (REQUIRED)
The container clusters must provide ways to expose services to outside users (REQUIRED).
The container orchestration layer must allow expressing scheduling preferences to mandate or avoid avoid containers/pods being scheduled on the same node and observe these. (REQUIRED)
Container cluster creators can create roles with limited capabilities inside their container clusters to implement a least-privilege approach (REQUIRED).
The management of the container clusters is done via REST APIs (REQUIRED).
The management of the workload containers in the cluster is done via REST APIs (REQUIRED).
The APIs follow the reconciliation loop paradigm where the user submits the wanted infrastructure state in a hierachical data structure (typically JSON or YAML) and the container orchestrator then repeatedly attempts to ensure that the reality matches the wanted state (REQUIRED).
The API is extensible, providing a mechanism by which schema for custom resources can be supplied and then used to validate customer resource creation requests (REQUIRED).
The customer can provide the needed components (containers, hooks, etc.) to provide the same automation for custom resources that the container platform provides for its native resources (REQUIRED). Sidenote: The reconciliation loop paradigm is implemented by so-called operators.
The platform should offer federatable identity management that can be used by customers for controlling access to the Virtualization Platform and to the Container Platform.
The platform should allow for customer-controlled user federation from external Identity Providers via industry standard protocols such as Open ID Connect (RECOMMENDED).
The Identity Provider may be another compatible cloud (RECOMMENDED).
The multi-tenancy supports two layers (RECOMMENDED).
Virtual resources belong to a project and only users that have roles with access to this project can see and manage them (REQUIRED).
By default, resources from different projects are isolated from each other (REQUIRED).
User and role management is possible in a self-service manner with isolated domains/realms (RECOMMENDED).
The container management platform allows to federate users from external identity providers using OpenID connect (and possibly other established standards), also allowing to leverage the same identities that are used on the virtualization layer (if so wanted by the user). (REQUIRED)
The platform comes with a standard set of operational tooling that eases the operation of the platform:
Lifecycle management: Tooling to deploy, remove, change and upgrade the software components at all layers (REQUIRED).
It is crucial that the rollout of an updated software component is a normal and regular activity with minimal customer impact, ensuring that security updates can be deployed short-term using regular processes (REQUIRED).
Observability: Tooling to permanently collect internal information from the infrastructure. Ability to configure alerts for irregularities (errors) and thresholds (e.g. capacity limits) (REQUIRED).
Health monitoring does complement this by running scenario tests from a user perspective (REQUIRED). This allows to observe the user-visible platform performance and monitoring errors, response times and other benchmarks as observed by users.
Logging and auditing: Important events, especially security-relevant ones are recorded and collected on protected log aggregation infrastructure for inspection and analysis (REQUIRED).
The logs are being automatically analyzed for anomalies (RECOMMENDED)
If this is not the case, they must be inspected manually regularly (CONDITIONALLY REQUIRED)
Metering: Usage information is collected in a traceable way in order to be able to create billing information. (REQUIRED for public clouds, obviously, but still RECOMMENDED for others)
Resource Management: In case of hardware or software errors, the cloud-/container platform may end up with resources that are still allocated but no longer in use. These need to be detected and be handled by the Operations team (REQUIRED).
It is recommended to automate the handling for common cases (RECOMMENDED).
For public clouds, it must be ensured that broken resources are not billed to the customer (REQUIRED for public clouds).
The platform offers a container registry where users can manage and analyze container (and optionally also VM) images (RECOMMENDED).
The platform offers a status page that the provider uses to communicate that availability and any deviations from it to users (REQUIRED).
This section links to any external documents that may be relevant, such as standards documents or other descriptions of this Building Block that may be useful.
Digitale Souveränität und Künstliche Intelligenz – Voraussetzungen, Verantwortlichkeiten und Handlungsempfehlungen des Digital Gipfel 2018
Include a wider range of internationally recognized frameworks, e.g. European Union Agency for Cybersecurity (ENISA) guidelines, GDPR for the EU, and other regional standards such as Singapore’s Personal Data Protection Act (PDPA)
Extend guidance on scalability taking into consideration diverse needs of nations with varying geographic and demographic contexts (e.g. geo-distributed architectures and multi-region deployments) and edge computing for improving service delivery in remote or underserved areas
Investigate where zero trust security should be included
Add requirements on data residency and thereby addressing the need for critical data to be stored within national borders or in compliant jurisdictions
Highlight the importance and impact of Service Level Agreements (SLA) and multi-vendor strategies to meet stringent uptime and redundancy requirements
Include strategies for implementing FinOps and provide guidelines on designing cost-effective, sustainable cloud infrastructures that align with global sustainability goals
This section provides context for this Building Block.
Use-Case for Cloud Infrastructure: Digital services require software to be deployed on hardware. Modern practices allow for fully automated, software-defined deployments across development, testing, and production environments. The infrastructure building block specification supports this by enabling software specification that allows to connect virtual resources as needed. Solutions which adhere to this building block should achieve these goals.
Mature open-source tools are available to build and manage flexible cloud and container infrastructures. With the right guidelines, standardization, hardware, and an operations team, this setup can run in any data center using common server stack. It can also be offered by providers as a public cloud service. This approach ensures control, transparency, and strong digital sovereignty.
States, societies, companies, and individuals have much to gain by leveraging modern digital cloud-based technology. Yet embracing technology without oversight over dependencies can easily result in situations where digital infrastructure is at the mercy of large foreign entities without any substantial possibility to exercise your own control over it, without any possibility to reflect your own values or to assert your own regulation over it. Vendor lock-in is the opposite of digital sovereignty. Governments need to have the choice and ability to design and control their IT environments and cloud technology to ensure security, independence, and data sovereignty for their citizens. See chapter 5 for considerations on digital sovereignty.
While many organizations have benefitted from gradually adapting their workloads to take advantage of the automation possibilities of cloud computing, the IT industry witnesses a new generation of workloads that has been designed from the ground up to take full advantage of the possibilities of cloud infrastructure; auto-scaling stateless services on-demand to the current level of load and automating a lot of the operational tasks that would otherwise be done manually by operations teams. These workloads are called cloud-native. While the first wave of these were based on virtual machines (VMs), we see a second, larger wave of these that leverage container technologies. In many cases, these containers run on top of virtual machines, thus allowing to balance good developer abstractions and fast scalability (where container technologies excel) with flexibility and isolation requirements (the strength of virtualization technology). In dedicated environments however, it can be beneficial to cut out the complexity of a virtualization layer and to run containers on bare-metal.
In all cases, users of the technology should consider the dependencies on providers of technology and infrastructure and take deliberate decisions on all components of the technology stack required to develop and run their workloads.
Key Digital Functionalities describe the core (required) functions that this Building Block must be able to perform.
After physical hardware has been deployed in the data center and added to an asset management system, it needs to be virtualized, so it can be sliced up in small pieces and on-demand composed into securely isolated small, medium and big chunks of capacity in a way consumable by customers. Compute capacity (the CPUs and increasingly also the GPUs and main memory) are virtualized by using hypervisor software (supported by commodity hardware technologies), which allows users to use fractions or large parts of a system as needed, securely isolated from other fractions.
Storage is virtualized by having multi-tenant storage systems which emulate virtual disks (block storage) and provide object storage capabilities. Using replication and error correction codes makes large storage systems resilient against failure of single storage media or single storage systems. Virtualized storage opens the possibility of distributed storage solutions.
Network is virtualized giving users that ability to create networks on the fly which are securely segregated from other users’ networks without the need to recable the data center. Good virtualized networks are also resilient against the failure of single network interface cards, cables or physical switches.
Combining these three basic virtualized capabilities – compute, network, storage – allows users to design their virtual data centers and roll out software and run workloads just like on purpose-built hardware setups. With standardized APIs, a virtual setup can be built in minutes that would take months to procure and build on physical hardware. The full automation and the quick creation of virtual resources allow to consider the virtual hardware setup as a fluid resource; scalable workloads can adjust the allocation of virtual resources to the capacity needs by the minute. This is what is meant by rapid elasticity. Virtual hardware has a few extra tricks available – memory, virtual disks, network interfaces may be hot-plugged (and hot-unplugged when not being used any longer) while the workload is running in a virtual machine. A virtual machine running on hardware that needs hardware maintenance (e.g. for exchanging a faulty RAM module or for installing a security update to the firmware or hypervisor) may be live-migrated to another hardware system without notable interruption. This way, hardware maintenance without disruption to user workloads is possible in a cloud environment. The configuration for these workloads can be stored in version control systems and subjected to the same reviews and testing rigor as code – the virtual infrastructure is managed like code – Infrastructure as Code.
There are mature open source technologies for the virtualization of compute, network and storage as well as for the virtualization management layer that abstracts these further by offering standardized APIs for consuming these. The virtualization management layer, the cloud management system, allows for many users to collaborate in common projects within a domain, while being shielded from other projects or users in other domains. It thus creates a secure multi-tenant platform, tracking all these virtual resources (and their assignment to physical resources) using hardware capabilities and virtualization technology to securely isolate the virtual resources from each other while offering the self-service APIs that allows users to e.g. plug virtual network cables into virtual switches to connect a virtual machine to a virtual network or moving a public internet-exposed IP address from one VM to another.
Using these APIs and readily available open source software, higher level abstractions can be built, e.g. for backup services, virtual load balancers, DNS service or secret stores and exposed via standardized APIs again. The mentioned examples should be considered standard features of a capable cloud management platform.
Cloud platforms that are typically used for building private clouds often allow for a huge amount of configurability. While this allows for highly specialized clouds, it comes with the downside of destroying the network effects of highly standardized platforms that has made the public cloud platforms so successful, facilitating the emergence of large ecosystems of applications. Standardizing the APIs and system behavior of cloud platforms (without necessarily prescribing many of the implementation details) and codifying best practices for these is thus a prerequisite for creating platforms that support self-reinforcing ecosystems. Where differences do provide meaningful differentiation for providers, standardized ways to make these discoverable are required.
Virtualization technology emulates real hardware with the advantage that you can run your own operating system (that was developed to run on physical hardware) inside a virtual machine. While this gives a lot of flexibility, it also creates some overhead. The operating system needs to be booted to initialize itself and the emulated hardware may not be the best level of abstraction for the individual services and components that comprise a workload. Taking the speed of elasticity to the next level, containers are being used. These are launched like normal processes in an already running operating system, just with their individual view of the file system (and thus e.g. the set of libraries being available), network connections and CPU and memory allocation. Elasticity is possible a the scale of seconds and below using container technology.
While the level of isolation between containers is not typically considered high enough to support scenarios where containers from potentially adversary parties should run on the same node, the isolation is certainly good enough to cleanly separate the various components of workloads run by one entity. The lack of interference has made container building a standard way of distributing complex software, allowing to overcome difficult to manage dependencies on system libraries and requirements for specific versions of supporting software.
To take advantage of the high flexibility of creating, changing and deleting containers on the fly, new abstractions to manage a fleet of containers that comprise a workload have been developed. Containers that belong together can be grouped and scaled as groups. Policies with respect to the network connections may be enforced centrally. Memory or CPU limits may be imposed. Containers may be replicated over several nodes to ensure resilience against failure.
Modern container orchestration technology is characterized by using a declarative way to describe the configuration of containers and their network and storage setups. Using sufficiently powerful abstractions in the declarative description of the desired state allows the container orchestrator’s reconciliation loop to continuously take actions as needed to adjust the reality to the desired state. This takes the burden off the teams that operate workloads to react to unexpected events (e.g. an application crash due to overload), as the container management system can be instructed to restart automatically (and to avoid the problem in the first place by using autoscaling on the service container before it becomes overloaded).
The container orchestration solutions knows what actions it needs to take to reconcile reality with the desired state (e.g. start a container if there should be one running but there is none). Building upon this powerful concept, so-called operators have emerged as a technology to reconcile reality with the desired state for more complex, custom-defined resources, such as e.g. a database service.
Containers are managed inside a set of nodes that belong together – a container cluster. In a virtualized environment, these nodes are typically virtual machines. To work well, these clusters should take advantage of the underlaying cloud infrastructure, by e.g. using the network management capabilities, accessing and managing persistent storage, ensuring that nodes use optimized node images and run on different physical hardware etc. This is the job of the container cluster management solution. With it, users can not just automate the deployment of workloads to an existing container cluster, but they can create, scale, upgrade, change and remove clusters on the fly (in the scale of minutes). This way, container clusters can become an elastic resource, to be created on demand for development, test, reference or production usage. All of this of course can be done in automated, API-driven approaches. DevOps teams often love to keeping all of their work inside the git version control system, using the same review and test processes for all of their work. While some git commits trigger code to be compiled and tested, others may cause test infrastructure to spin up and perform testing of an infrastructure change. This modern approach to managing virtualized and/or containerized workloads is called gitops. The declarative approach to container management makes it particularly suitable for this.
All the fluidity of code and infrastructure may seem to create instability. And while it’s true that changes can happen a lot faster across many more layers of a production environment in a container orchestration environment, good engineering teams have built practices that leverage the automation and flexibility to impose rigorous testing practises. If an infrastructure change can be tested just like a software code change can, it will be. Test environments can be created and deleted on the fly; it is much more practical in such environments to have test environments that closely resemble the real production environment. Except for using mock data and some artificial automated tests, it is best practice to use exactly the same code with very similar configuration to do the integration testing as is used for production rollout. This can happen on every proposed code change – an approach called continuous integration (testing). It is very common on containerized application development.
All these advantages have made containerized workloads the standard choice for most freshly developed modern application development. This is why many people equate the term “cloud native” with containerized micro-service architectures.
Complex technology comes with weaknesses. Whether those are conceptual weaknesses, software bugs, hardware bugs or just limitations that are not well documented, these put the usefulness of IT solutions at risk. Worse, a signifcant fraction of these issues may be abused by actors to gain unauthorized access to data or even control over other parties’ IT systems. These actors may range from curious (and otherwise harmless) engineers or researchers via cybercriminals that want to extort money to intelligence services or military units from nation states. Having several layers of protection against such actors and automated or trained reactions in case of breaches is a baseline security requirement for every IT solution outside of highly segregated environments.
Some best practices need to be built into the operational processes and the architecture of cloud and container solutions to be resilient:
(1) A clear distinction between different roles and the coresponding authorizations; a well-documented way to manage these authorizations, and strong authentication mechanisms.
(2) Well-trained staff that understands the risks of falling prey to social engineering attacks and with sufficient staffing that allows people to think twice and to follow four-eyes principles for administrative operations with elevated privileges.
(3) A set of rules, permissions and processes that allows people to do their work without violating the rules and that is thus followed and enforced in real-life.
(4) Using well-defined and limited interfaces with good abstractions that reduce the attack surface (for example the virtualization abstraction where a well-understand hardware interface is being used for isolation).
(5) A learning culture that does avoid shaming people for making mistakes but focuses on analyzing errors, making issues transparent and providing a learning experience beyond just the involved indidividuals, creating processes and automation that ensures similar mistakes can not happen again.
(6) A collaboration culture that encourages reviews of each other’s work, where questioning the design and implementation of a solution is considered valuable feedback.
(7) An approach of least privileges; each actor (human or machine) only has the privileges it needs to do its job. This also means that actors that needs lots of different privileges probably should be split into several actors. This may mean that human beings (often operators) may be assigned several roles, but can only perform actions with one of them at a time.
(8) A system of defense in depth, where the elevation of privileges on one system tends to be contained to affect this one system only.
(9) Using encryption (with published, well-understood, state-of-the-art algorithms) for data at rest and in transit to ensure that eavesdropping does not compromise the confidentiality of data.
(10) A clear understanding that customer input may never be trusted and always needs to be validated / sanitized when being processed.
(11) The usage of secure programming languages and/or tools that scan for typical programming mistakes.
(12) Using penetration testers to find weaknesses in a system.
(13) Offering a way for security researchers to report findings and reward them.
(14) Creating a security team and providing contact information for outsiders to contact it.
(15) Evaluating published security reports and be connected to relevant pre-disclosure security channels to receive advance warnings.
(16) Publishing security advisories and providing and/or deploying security fixes short-term. This requires running reference environments such that security issues can be reproduced, fixes be validated and then deployed to production with confidence without long lead times.
(17) Adhering to relevant security standards and certifying compliance against them.
State of the art identity and access management systems allow user administrators to assign authorizations to users by assigning them to a group that reflects their job assignment. Experts inside the organization that understand how cloud infrastructure is being employed inside a project can then do a fine-grained setting for that group, deciding to what virtual environments and resources full access and read-only access is required.
In a federated scenario, authentication and group assignment are typically done within a customer Identity Management system; the cloud’s Identity and Access management system then needs to support the federation with the customer’s system. Ideally this can be configured by the customer itself. The IAM system in the specific cloud then needs to be configured to map the groups to specific authorizations. This is also under the customer’s control.
It is recommended to keep the authorization primitives simple as in “group G has the right to fully control resources of type T in project P” or “group H has the right to have read-only acccess to resources of type U in project Q”. This makes audits possible and allows for reasoning about security properties. More complex systems may allow for more flexibility, but easily end up being not fully understood by users and operators, provoking misuse that can easily result in security vulnerabilities.
Identity and Access Management (IAM) needs to support federation; allowing to keep authentication and group assignment in a corporate directory or use the system from another cloud that uses the same standards. It needs to allow for a high degree of self-service. It needs to cover access decisions to both the virtualization and the container layer (if both are offered by an infrastructure offering). It needs to provide a high level of availability.
Highly distributed dynamic systems such as cloud and container platforms have a high tendency for complex patterns of unwanted behavior. With increasing maturity of the management software, this may become less frequent, but it is absolutely best practice to closely monitor the availability of all services on a platform. Beyond just the monitoring of service availability (“does the service respond when I connect to it?”), it is recommended to do scenario testing, where a model workload is being automatically deployed into the cloud and/or container infrastructure and tests are done to measure the workload’s functioning. Recording success rates and also performance allows for seeing trends and detecting services that may respond to requests but don’t successfully fulfill them. This is often called health-monitoring.
Alerts can be created and simple, recurring issues can be acted on automatically, whereas other alerts need to reach human beings for further rectification and resolution. Alerts should also be triggered based on trends, where extrapolating e.g. a usage trend can engage capacity management actions such as procuring and deploying additional hardware before capacity runs out.
Health monitoring can best be done from a user’s perspective with normal user’s rights (i.e. without any elevanted privileges). Good providers share the results with their customers, providing real-time transparency over the health of the platform.
Health-monitoring from a user perspective (top-down) should be complemented by infrastructure monitoring (bottom-up), where hardware and software failures are reported. While a resilient setup hides most of these failures from becoming user-visible, they do tend to impact the level of redundancy or performance, so they do need to be acted upon, just with significantly lower urgency.
For debugging difficult issues, logs will be collected and need to be aggregated, correlated and indexed for searchability. Logs may also be evaluated in case of security incidents or to audit that system’s compliance status.
The installation of cloud infrastucture can be automated – once all the hardware setup is completed and properly recorded in the asset management system. The automation of course is especially useful when setting up test or reference clouds. The freedom to do this regularly offers the ability to test various software states in an automated way, e.g. performing dozens of upgrades (and rollbacks) before touching the production environment. It is recommended to keep the system configuration in a versioned software revision control system such as git.
The process how to enlarge an environment with additional hardware (or virtual hardware when talking about the container layer) needs to be automated and documented. The same holds true for retiring old (or broken) hardware.
During operation, updates and especially bug and security patches may need to be installed rather often; the lifecycle management tools need to facilitate the collection of the current software state and the rollout of a defined newer state. Rollback needs to be possible as well to be able to go back to a known working (though potentially vulnerable) state.
Software occasionally makes larger jumps, adding lots of new functionality, but then also including breaking changes. Good software management minimizes breaking changes and announces these ahead of time via e.g. deprecation notices. Good release notes for major updates help users to assess the impact.
The life cycle management tools need to support with installing major updates. The process documentation should contain information on the expected impact, e.g. downtimes for the control plane, reduced performance or so. Providers should use these to announce the impact to their customer ahead of time.
The cost of running the platform needs to be allocated according to the usage; what is a requirement for a public cloud still makes sense for internal platforms to track the needs and usage of various departments (and encourage avoiding unecessarily large consumption of resources and energy).
The cloud and container platforms need to support recording the usage; this requires at least the recording of creation and deletion events for the metered resources. As events can get lost, the metering solution should compare the state derived from the state machine fed from the events with the status that the platform reports – if an event that a VM has been created was received, but no deletion event, then the state machine would predict that the VM must exist, which can be validated. Errors must be flagged, so operators can investigate and erroneous invoices can be avoided. With such a validation logic, an event based system is preferable over a pure polling system due to much higher efficiency and better accuracy.
Couture, Stephane, and Sophie Toupin. 2019. “What Does the Notion of ‘Sovereignty’ Mean When Referring to the Digital?” New Media & Society 21 (10): 2305–22. .
Philpott, Daniel. 2020. “Sovereignty.” In The Stanford Encyclopedia of Philosophy, edited by Edward N. Zalta, Fall 2020. ; Metaphysics Research Lab, Stan-ford University.
Open Operations Manifesto
Software development and operations has advanced significantly over the last two decades; the creation of customer-focused cross-functional fast-moving agile teams has significantly enhanced the speed of innovation and – if implemented right – also the quality of the delivered service. To leverage their full potential, such teams need access to pooled infrastructure that can be consumed on-demand via the network with self-service APIs that allow full automation and rapid elasticity in a pay-per-use model. The reflects this. The empowering effect and the process speed up by allowing developers, testers and operators to create and connect the needed infrastructure via self-service in a fully automated way controlled by software can not be overestimated.
In case of the Hosting Infrastructure specifications, we do not highlight important requirements or describe any additional cross-cutting requirements to the .
In this section security approaches of cloud computing environments are described in addition to: .
In this section Identity and Access Management approaches of cloud computing environments are described in addition to the creation, verification and authentication of identities: .
0.8
Kurt Garloff, Friederike Zelke, Maximilian Wolfs, Artem Goncharov
Key Functionality; Functional Requirements (ch. 6); Data structures (ch. 7) and APIs (ch. 8); Internal workflows (ch. 9) and architecture block diagram; use short data structure for server creation as example; Review and small corrections in chapters 6, 7, 8
0.9
Kurt Garloff
Add AZs and Regions. Token revocation. Floating IPs. Move IAM topics into an own section in chapter 6. Better describe flavors. Move user-data support and consumption into chapter 6. Require log analysis. Mention possible orchestration services with declarative API for IaaS. Scheduling (anti-)affinity for containers. Added optional billing and audit exposure for customers. Add Reference to OpenID Connect specs. Add example data structure for container deployment. Add example OpenAPI spec for status page.
1.0
Kurt Garloff, Nico Lück
Formatting changes; Incorporation of feedback from Kristo Vaher, Ott Sarv, Sreepathy H Vardarajan to specific requirements in chapter 4-6 and for future considerations (v1.X and v2)
This section provides a detailed view of how this Building Block will interact with other Building Blocks to support common use cases.
Main interactions with other building blocks (BBs) is that other BBs run on the virtualized or containerized infrastructure as provided by this BB.
The interaction is typically as follows:
A virtualized application will come with opentofu configuration or ansible playbooks or pulumi recipes to create the needed infrastructure.
Some parameters may need to be adjusted to fit the specific infrastructure (standardization helps here)
The Infra-as-Code tool applies the steps as described in the recipe, observing the dependencies by using the API calls to create the wanted resources.
Configuration changes can be applied on top, leading to incremental changes (but depending on the tool sometimes in surprising recreations caused by the way the tool works)
The task for the infrastructure in all this is really to carefully act as instructed and report status back
For containerized applications, the workflow typically is as follows:
A container cluster gets created via the appropriate API calls (if there is none already)
The application will come with a deployment file (yaml) or will use some templating engine that processes the files before submitting them to the container orchestration API
The services will create the needed resources – typically downloading the required container images (from the internet or the provided registry), starting them. Typically, no dependency handling is done, as cloud-native applications should be programmed to just retry again and again until all prerequisites are available, making them resilient against sequencing variance in deployment but also in recovery scenarios.
Here again, the role of the infra is to accept the requests, validate them and – if done so successfully – execute them.
Due to the asynchronous nature of reconciliation loops and retried connections, the final result often needs to be retrieved by looking at the pod and service statuses or the log files.
Steps and processes involved in creating and configuring a new virtual machine, from resource allocation to OS installation and network setup.
A VM typically needs some prerequisites: A flavor (VM size) needs to be chosen, an image to boot from, a network to connect to at boot time (later connections are possible as well), possibly a security group setting to control the network access to the machine (can be done later as well), an ssh key to inject (on Linux VMs) and possibly storage to attach at boot time (can also be done later, except if there is no disk to boot from).
A typical workflow for creating a single VM is:
Choose name
Choose VM size/properties (flavor)
Choose image to boot
Choose volume size for the boot disk (unless a flavor is used that already has a fixed size local disk attached)
Choose one (or multiple) network(s) to connect to or create them via API calls or GUI, optionally connect the network to a router (API or GUI)
Choose an existing security group or create one (via API calls or GUI) to control network access to the machines network ports
Choose or create an ssh keypair
Use API call (or GUI) to create the VM
See chapter 7.1 for an example.
The API calls can be done via a command line interface tool or a programming/scripting language. Most people prefer IaC tools at least once there is more than one VM involved.
Detailed sequence of actions for deploying an application using container technologies, including image fetching, container orchestration, and network setup.
A typical workflow would look like this:
Ensure there is a container cluster
Adjust parameters in template parameter file
Render the template into the customized deployment file and submit it to the container orchestration engine’s API
Watch the creation of pods and services and look at the logs in case things go wrong
Containers orchestration tends to be built in a way that resources just try to connect to needed resources on the fly and retry if they fail. This avoids the need for dependency handling in many cases and makes solutions also robust against services that need to be restarted. On the flipside, many tools don’t even have a notion of dependency handling.
Procedures for dynamically scaling up or down the cloud resources based on current load and performance metrics, ensuring efficient resource utilization.
The virtualization layer may offer an orchestration service that allows dynamic creation of VMs behind the loadbalancer based on time or load criteria. (OPTIONAL)
The container orchestration layer supports autoscaling – additional pods (replicas) get deployed based on the load. Alternatively additional CPU or RAM resources can be automatically assigned to pods under high load.(REQUIRED)
The container cluster itself may not have enough capacity, so the cluster size needs to be enhanced; it is recommended to provide the mechanisms to enable cluster autoscaling (RECOMMENDED).
Processes for integrating the cloud infrastructure with external identity and access management systems to enable seamless user authentication and authorization across services.
The IAM component in the infrastructure needs to provide APIs (and a GUI) that allow users with the appropriate privileges to create users, groups, roles and assign them to authorize them to access virtualization and/or containerization infrastructure. This user management may instead be performed on a separate Identity Provider that is then consumed by the cloud infrastructure.
Typical workflow:
Create user
Assign user to one or several existing groups
Review roles that are assigned to the groups to understand the authorizations that come with the groups
Optionally create new roles and assign them to new groups
Optionally create new projects and add authorizations for the project to existing or new roles
Ask the user to authenticate and test whether she can access what’s required
This section provides a reference for APIs that should be implemented by this Building Block.
The same comment as in section 7 applies. We refrain from defining Meta-APIs here, but may agree to do so at a later stage. The implementations should use standard APIs that are well supported by IaC tooling.
Interface for creating, managing, and deallocating virtual machines on the cloud infrastructure.
The API is a REST interface; a resource gets created by and authorized POST
request to the relevant endpoint (as listed in the catalogue) including the JSON data structure describing the resource properties. The methods GET
DELETE
, PUT
, PATCH
are supported to list or retrieve details, to destroy a resource and to apply changes to an existing ressource. The normal http response codes indicate whether the request was successful. The authentication happens via a header with the authorization token or via a client cert.
Note that resource creation can be a long-running process. If the request was legal, a success code (200) is returned and a message with the resource UUID, but polling for the status of the resource would be needed to ensure it’s ready before relying on it’s readiness. Details are implementation dependent, a defensive automation apporach would always check on the state of a resource.
The interface is an imperative one – the API call requests the platform to do X. The platform attempts to do so, and if there is a failure, it will be reported. The resource creation may succeed and then later enter into a failure state. This won’t be automatically fixed either but can be seen from the state.
Orchestration services with a declarative interface that describe the wanted target state that the orchestrator then tries to create are possible.
Endpoints for creating, attaching, resizing, and deleting storage volumes used by virtual machines and containers.
The same considerations apply as in 8.1.
Services for setting up, modifying, and tearing down network configurations, including virtual networks, subnets, and security groups.
The same considerations as in 8.1 apply.
Interfaces for deploying, scaling, and managing containerized applications, supporting operations like starting, stopping, and monitoring containers.
The REST API submits hierarchical data structures (typically as YAML) that describe a wanted state, like “there should be two replicas of this pod in the cluster”. (A pod is a set of containers that belong together and are always scheduled together.) The reconciliation loop of the container orchestration software is supposed to ensure that those two replicas get started. If a node dies, it will soon take care of starting another replica to ensure we (almost) always have the desired state.
The data structures are defined by the implementation and allows for being extended via schema for custom resources.
Endpoints for managing user identities, roles, permissions, and authentication policies.
Both the virtualization layer and the container orchestration layer allow for defining users, groups, and roles locally. For allowing federated infrastructure, it is desired, however, to use federated identities and configure the users in the IAM component or the customer’s own Identity Provider.
Services for collecting, querying, and analyzing logs and monitoring data from cloud resources and applications.
Implementation specific.
Interface for retrieving the current status and health metrics of various cloud services and components, enabling real-time monitoring and transparency.
Infrastructure providers should publish the status of their services to the users, allowing them to determine whether service limitations are happening in the provider’s realm. Providers use an API to publish their status that is then displayed on a web page accessible to least all platform users.
An example API specification would look like this:
This section provides information on the core data structures/data models that are used by this Building Block.
Prescribing a meta-API of our own would happen in competition with existing efforts and might suffer from low adoption with poor tooling coverage. This may be a worthwhile initiative on its own as soon as the demand from stakeholders and their willingness to adopt the results does exist.
We shall describe the high-level characteristics of the data structures (this chapter) and APIs (next chapter) to give guidance for the requirements and design of good data structures and APIs. It will be the role of the chosen technology for implementation to define the detailed data structures and APIs. The description shall not be reproduced in a govstack document – instead the upstream’s technology description shall be referenced and only additional requirements, enhancements or restrictions be noted here.
Guidelines and protocols for implementing infrastructure as a service (IaaS) and container orchestration layers to ensure interoperability and standardization.
The resources (virtual machines, networks, containers, …) are uniquely identified by UUIDs, assigned by the platform upon creation of the resources. Resources typically also have names that allow human operators to remember which resources serve which purpose. Platforms may or may not enforce the uniqueness of names – referencing resources by UUID is thus preferred.
The data structures are a hierarchical model of the (virtual/containerizedd) resources. They are typically encoded in JSON or YAML.
We shall reproduce one data structure (in JSON) describing the data structure needed to create a virtual machine here as an example:
The block_device_mapping_v2
tells the cloud orchestrator to create a virtual disk (“volume”) with 10 GiB size that should be initialized with the contents of the image identified by uuid 0b28a60d-6b2e-45d2-b366-df2dd544c63b
. The other fields describe the name of the VM, its size (flavorRef
), the injected ssh key, the number of VMs to be created, the network that it should attach to and the used security group and some more details on the volume, such as size and behavior upon VM deletion.
The data structures to describe the wanted state for a container workload contain the containers, their networking and storage properties etc. An example may look like this:
This example creates a pod with a single container (to be downloaded from the location specified in image
setting), with certain CPU and memory allocations and limits, a persistent volume with ReadWriteOnce
properties and 10GiB
size, mounted to the /data
path and settings for the orchestrator to probe the container for readiness and healthiness.
Submitting this declaration to the container layer will make it create the resources (a container and a persistent volume) and continuously watch the container’s status.
Specifications for identity and access management and operational tools to maintain security and efficiency.
Data models for tracking the allocation and utilization of cloud resources, such as compute, storage, and network capacities.
Resources are created, changed and deleted upon user requests, within the limits that the authorization (role) and quota for the user allow.
An efficient platform for recording usage does record the creation, change and deletion events. It however needs to monitor for the success of these requests. In addition there may be non-user triggered events that change the life-cycle of resources – these need to be recorded as well. The theoretical state of ressources can be constructed from these events. It is required to poll the real state regularly to ensure no event was missed or no error condition caused a state change that would lead to wrong assumptions, wrong usage records and thus wrong bills.
A minimalistic example of a usage record of this kind would be
This VM with UUID
was created on Jul 2 in the early morning with the size SCS-2V-4
based on a user request. Successful UserRequest
s for resized
, stopped
, started
, deleted
would result in similar records. OperatorActions
and Reconciliation
would be other types of events. The regular polling could also create StateChange
records of type None
if the state is unchanged and as expected for audit reasons. The Consumption
field may be used for resources whose usage is measured, such as e.g. external network traffic (common) or storage data transfer (uncommon).
Note that this is not meant to prescribe a way of creating Usage Records; this may happen in a later revision when the usage records will be exposed to users directly, which is not currently a requirement.
Templates and schemas for configuring virtual machines and containers, including hardware specifications and software dependencies.
Structured formats for storing logs and monitoring metrics to facilitate observability, troubleshooting, and performance tuning.
Not specified here, implementation dependant.
Providers should allow customer admins to access billing information and logs that allow to understand when a resource was changed by whom (OPTIONAL).
Data structures defining roles, permissions, and access controls for users and services within the cloud infrastructure.
The cloud infrastructure has a number of predefined roles; these roles apply to a scope. Predefined roles and scopes have a hierarchy; a more powerful roles encompasses all less powerful roles. The combination of a scope with a role is a persona.
The following roles exist, from most powerful to least powerful:
admin
Cloud operator, omnipotent
manager
Customer role for self-management
member
Customer role to create and manage ressources
reader
Customer role with read-only access
The following scopes exist (from largest to smallest)
system
The complete cloud region
domain
Users, Roles, Project within the domain
project
All resources belonging to a particular project
The most commonly used personas are
system-admin
Omnipotent
domain-manager
Manage users, roles, federation, projects
project-member
Manage all resources within the project
project-reader
Review resource usage within the project
On the container layer, we have the standards roles cluster-admin
predefined who can manage all aspects of the container cluster. Less powerful roles can be created.
For the federation, OpenID Connect should be supported; the protocol and data structures are described by the from the OpenID Foundation’s relevant working group.
A precise description of all data structures of a full cloud- and container platform would fill hundreds of pages and also imply specific technologies to implement the cloud- and container orchestration layers. Industry attempts to define technology neutral meta APIs and data structures have thus far had limited traction; the most successful one is probably . Infra-as-Code tools (such as opentofu or ansible) also have their own representation of the resources and their properties – however, they do not abstract away the differences in the object model of different cloud- and container orchestration systems.
User federation should be supported using OpenID Connect. OpenID Connect build on top of the provien oauth2 mechanisms. The are available from the OpenID Connect foundation’s working group.
It is best practice for users not to build highly specialized images for each task a VM might need to perform. This would results in dozens of images for typical workloads and in case a software update (e.g. a security update for the used operating system) is required, all of these would need to be rebuilt and reregistered. Instead, there is a common mechanism to customize images on (first) boot, using so called user-data and cloud-init or alternatives (such as cloudbase-init on Windows or coreos-cloudinit for CoreOS). The cloud-init supported commands are documented by the