Enterprise Container Strategy

18 min readJul 28, 2018

A common strategy is to use CaaS (Container as a Service) model. In a CaaS model, containers and clusters are provided as a service that can be deployed in on-premises data centers or over the cloud.

Areas of considerations:

Architecture
Implementation
Tooling
Governance

Architecture

CaaS — Container as a Service

Conceptual Reference Architecture:

Docker Enterprise:

Hypernetes — Open-source multi-tenant CaaS

Networking

Containers use Linux partitioning capabilities called Cgroups and Namespaces. Containers are mapped to network, storage and other namespaces. Each namespace “sees” only a subset of OS resources to guarantee isolation between containers.

Container network namespace has its own network stack with interfaces, route tables, sockets and IPTABLE rules. An interface can only belong to one network namespace. Using multiple containers requires multiple interfaces. Another option is to generate pseudo-interfaces and soft wire them to a real interface (we can also map containers to the host network namespace, as used for daemons).

Options for creating and wiring a pseudo-interface:

Virtual Bridge

Create virtual interface pairs (veth) with one side in the container and the other in the root namespace, and use Linux bridge or OpenvSwitch (OVS) for connectivity between containers and external (real) interfaces. Bridges may introduce some extra overhead when compared to a direct approach.

Multiplexing

Multiplexing can consist of an intermediate network device that exposes multiple virtual interfaces, with packet forwarding rules to control which interface each packet goes to. MACVLAN assigns a MAC per virtual interface

Hardware Switching

Single Root I/O virtualization (SR-IOV) is a way to create multiple virtual devices. Each virtual device presents itself as a separate PCI device. It can have its own VLAN and hardware-enforced QoS association. SR-IOV provides bare-metal performance but is usually not available in the public cloud.

Kubernetes networking implementation options:

Cilium

Cilium is open source software for providing and transparently securing network connectivity between application containers. Cilium is L7/HTTP aware and can enforce network policies on L3-L7 using an identity-based security model that is decoupled from network addressing.

Contiv

Contiv provides configurable networking (native l3 using BGP, overlay using vxlan, classic l2, or Cisco-SDN/ACI) for various use cases. Contiv is all open sourced.

Contrail

Contrail, based on OpenContrail, is a truly open, multi-cloud network virtualization and policy management platform. Contrail / OpenContrail is integrated with various orchestration systems such as Kubernetes, OpenShift, OpenStack and Mesos, and provides different isolation modes for virtual machines, containers/pods and bare metal workloads.

Flannel

Flannel is a very simple overlay network that satisfies the Kubernetes requirements. Many people have reported success with Flannel and Kubernetes.

Nuage

The Nuage platform uses overlays to provide seamless policy-based networking between Kubernetes Pods and non-Kubernetes environments (VMs and bare metal servers). Nuage’s policy abstraction model is designed with applications in mind and makes it easy to declare fine-grained policies for applications. The platform’s real-time analytics engine enables visibility and security monitoring for Kubernetes applications.

OpenSwitch

OpenVSwitch is a somewhat more mature but also complicated way to build an overlay network. This is endorsed by several of the “Big Shops” for networking.

Weave

Weave Net creates a virtual network that connects Docker containers across multiple hosts and enables their automatic discovery.

Project Calico

Calico provides highly scalable networking and network policy solution for connecting Kubernetes pods based on the same IP networking principles as the internet. Calico can be deployed without encapsulation or overlays to provide high-performance, high-scale data center networking. Calico also provides fine-grained, intent-based network security policy for Kubernetes pods via its distributed firewall.

Service Registry & Discovery

Service discover patterns:

Client-side Discovery

With the client-side discovery, the application’s clients talk to a service registry where all your application’s endpoints and backends are stored and kept up to date. Clients talk to the registry directly (they are “registry-aware”) and usually perform the load balancing logic between the list of backends.

Server-side Discovery

clients talk to the load balancer and the load balancer talks to the service registry. This removes the need for the clients to be registry-aware, and also makes the load balancer a demarcation point where several operations related to security, monitoring, and service delivery can be centralized.

Server-side Discovery with DNS

The simplest way to enable server-side service discovery is via the venerable Domain Name Service (DNS) that fuels the internet. DNS clients are embedded in all operating systems, so that you (or your program) just need to remember “which service”, e.g., “server04.databaselayer.mycompany.net” internally, and a DNS server will provide an address for it.

Drawbacks:

DNS is very slow adapting to changes. Microservices and containers are by nature very dynamic and their locations change often and quickly.
A DNS based solution will very likely support only a round-robin load balancing logic (the DNS will return the entries it has for a name in a round robin fashion, and in this pattern there’s no load balancer involved), whereas for a modern microservices implementation you typically want granular load balancing policies that consider latency, load, number of connections, or other parameters.
Server-Side Discovery with a Load Balancer

Server-Side Discovery with a Load-Balancing Proxy

A twist and optimization on the previous approach is to co-locate a proxy or load balancer in each of the servers of the cluster where the services are being launched. AirBNB’s Smartstack is an early example of this, and the Kubernetes “kube-proxy” is a more recent one.

Server-side discovery with an API Gateway

An API gateway mediates between the clients and the (micro)services behind it, proxying requests to them and possibly providing a tailored API depending on the client type or the service version. In a microservices architecture, where your backend services may be subject to change and be decomposed in smaller services, or where your versions may change quickly, an API gateway could allow you to maintain homogeneity in the interface offered to clients, while still being able to make changes in your backend implementation.

Storage

Strategies for managing persistent data storage

Host-Based Persistence
Host-based persistence is one of the early implementations of data durability in containers, which has matured to support multiple use cases. In this architecture, containers depend on the underlying host for persistence and storage. This option bypasses the specific union filesystem backends to expose the native filesystem of the host. Data stored within the directory is visible inside the container mount namespace. The data is persisted outside of the container, which means it will be available when a container is removed.
In host-based persistence, multiple containers can share one or more volumes. In a scenario where multiple containers are writing to a single shared volume, it can cause data corruption. Developers need to ensure that the applications are designed to write to shared data stores.

Shared Multi-Host Storage
While both techniques discussed above offer varying levels of persistence and durability, there is one major drawback with them — they make the containers non-portable. The data residing on the host will not move along with the container, which creates a tight bond between the host and container.

The latest approach for shared persistence is by far the best approach for shared storage. A container process that is running on each storage cluster member is responsible for keeping track of the backend persistence storage (host level devices and also reference-able volume names), reports to a key-value store what other nodes are in the storage cluster and also acts as a the plugin to the Docker engine API which basically keeps track of what the container is writing across all nodes in the storage cluster. If that node within the storage cluster fails, failover to another node.

Typical use cases:

Databases
Hot-mounting source code
Master-Worker
Volume Plugins
Although host-based persistence is a valuable addition to Docker for specific use cases, it has the significant drawback of limiting the portability of containers to a specific host. It also doesn’t take advantage of specialized storage backends optimized for data-intensive workloads. To solve these limitations, volume plugins have been added to Docker to extend the capabilities of containers to a variety of storage backends, without forcing changes to the application design or deployment architecture.
Starting with version 1.8, Docker introduced support for third-party volume plugins. Existing tools, including Docker command-line interface (CLI), Compose and Swarm, work seamlessly with plugins. Developers can even create custom plugins based on Docker’s specifications and guidelines.

As of June 2016, Docker supports over a dozen third-party volume plugins for use with Azure File Storage, Google Compute Engine persistent disks, NetApp Storage and vSphere. In addition, projects like Rancher Convoy can provide access to multiple backends at the same time.

Typical use cases:

Data-intensive applications
Database migration
Stateful application failover:
Reduced Mean Time Between Failures (MTBF)

Software-Defined Storage & Solution Providers

Existing storage technologies, such as network-attached storage (NAS) and storage area network (SAN), are not designed to run containerized applications. Software-defined storage abstracts these traditional types of storage to expose the virtual disks to the more modern applications.

Container-defined storage, a new breed of storage, is a logical evolution of software-defined storage, which is purpose-built to match the simplicity, performance and speed of containers. Container-defined storage runs on commodity hardware, featuring scale-out block storage, which in itself is deployed as a container. It provides per-container storage, distributed file access, unified global namespace, fine-grained access control, and tight integration with the cluster management software.

One of the key advantages of using software-defined storage for containers is the ability to virtualize storage, which may be based on faster solid-state drives (SSD) or magnetic disks. Aggregating disparate storage enables IT to utilize existing storage investments. Some flavours of container-defined storage can automatically place I/O-intensive datasets on faster SSDs while moving the archival data to magnetic disks. This delivers the right level of performance for workloads, such as online transaction processing (OLTP), which demand high input/output operations per second (IOPS).

Many companies are working on integrating software-defined storage with containers, with many of them selling appliances or Storage-as-a-Service using. Portworx, Hedvig, Joyent Manta and Blockbridge all provide developers with access to their software without requiring them to buy something else.

StorageOS, Robin Systems and Quobyte are examples of companies that do not provide unbundled access to their software.

It is recommended to consider a software-defined storage approach it offers the advantage of such portability and segregating storage management from applications.

Source: Managing Persistence For Docker Containers

Service Mesh

The term service mesh is often used to describe the network of microservices that make up such applications and the interactions between them. As a service mesh grows in size and complexity, it can become harder to understand and manage. Its requirements can include discovery, load balancing, failure recovery, metrics, and monitoring, and often more complex operational requirements such as A/B testing, canary releases, rate limiting, access control, and end-to-end authentication.

Istio is a tool that provides a complete solution to satisfy the diverse requirements of microservice applications by providing behavioural insights and operational control over the service mesh as a whole. It provides a number of key capabilities uniformly across a network of services:

Traffic Management: control the flow of traffic and API calls between services, make calls more reliable, and make the network more robust in the face of adverse conditions.
Observability: gain an understanding of the dependencies between services and the nature and flow of traffic between them, providing the ability to quickly identify issues.
Policy Enforcement: apply an organizational policy to the interaction between services, ensure access policies are enforced and resources are fairly distributed among consumers. Policy changes are made by configuring the mesh, not by changing application codes.
Service Identity and Security: provide services in the mesh with a verifiable identity and provide the ability to protect service traffic as it flows over networks of varying degrees of trustability.

Istio Architecture:

Implementation

Container build

The assembly line of a container starts with packaging software container images. A list* of things to avoid when building container images.

Don’t store data in containers: a container can be stopped, destroyed, or replaced. An application version 1.0 running in a container should be easily replaced by version 1.1 without any impact or loss of data. For that reason, if there is a need to store data, use volume. Make sure that applications are designed to write to a shared data store.
Don’t ship your application in two pieces: applications should be part of the image.
Don’t create large images: a large image will be harder to distribute. Don’t install unnecessary packages or run “updates” (yum update) that downloads many files to a new image layer.
Don’t use a single layer image: to make effective use of the layered filesystem: base image layer for the OS, another layer for the username definition, another layer for the runtime installation, another layer for the configuration, and finally another layer for the application.=
Don’t create images from running containers: in other terms, don’t use “docker commit” to create an image. This method to create an image is not reproducible and should be completely avoided. Always use a Dockerfile or any other S2I (source-to-image) approach that is totally reproducible, and track changes by storing Dockerfile in a source control repository (git).
Don’t use only the “latest” tag: the latest tag is just like the “SNAPSHOT” for Maven users. Tags are encouraged because of the layered filesystem nature of containers. The “latest” tag should also be avoided when deploying containers in production as you can’t track what version of the image is running.
Don’t run more than one process in a single container: containers are designed to run a single process (HTTP daemon, application server, or database).
Don’t store credentials in the image: use environment variables — You don’t want to hardcode any username/password in your image. Use managed secrets by Docker or Vault.
Don’t run processes as a root user: by default docker containers run as root. As docker matures, more secure default options may become available. Requiring root is dangerous for others and may not be available in all environments. Images should use the USER instruction to specify a non-root user for containers to run as.
Don’t rely on IP addresses: each container have their own internal IP address and it could change if you start and stop the container. If an application or a microservice needs to communicate to another container, use environment variables to pass the proper hostname and port from one container to another.

Docker Image Size

Docker container file-system uses a Copy-on-Write technique, see Picture 1. Every RUN instruction in the Dockerfile writes a new layer in the image. Every layer requires extra space on the disk. To keep the image size small:

Use a smaller base image
Don’t install debug tools like vim/curl or other tools is not necessary
Minimize Layers
Use — no-install-recommends on apt-get install
Add rm -rf /var/lib/apt/lists/* to same layer as apt-get installs
Use linter and validator such as Dockerfilelint
Use multi-stage build

Testing

Traditional testing approach:

Using CI to orchestrate testing.

Disadvantage:

Not able to reproduce exactly the same development and testing environment outside the CI.
Need to configure the testing tools (correct versions and plugins), configure runtime and OS settings, and get the same versions of test scripts, etc., to create new test environments.

“Test aware container” approach:

This approach is to take advantage of “ONTEST [INSTRUCTION]” directive in Dockerfile and use “docker test [OPTIONS] IMAGE [COMMAND] [ARG…]” to automatically create and run a test version image, being tagged with <image name>:<image tag>-test — executing all build instructions defined after ONTEST command, and ONTEST CMD (or ONTEST ENTRYPOINT). The “docker test” command should return a non-zero code if any tests fail. The test results should be written into an automatically generated VOLUME that points to /var/tests/results folder.

Example Dockerfile:

FROM “<base image>”:”<version>”
WORKDIR “<path>”
# install packages required to run app
RUN apt-get update && apt-get install -y \
“<app runtime> and <dependencies>” \ # add app runtime and required packages
&& rm -rf /var/lib/apt/lists/*
# install packages required to run tests
ONTEST RUN apt-get update && apt-get install -y \
“<test tools> and <dependencies>” \ # add testing tools and required packages
&& rm -rf /var/lib/apt/lists/*
# copy app files
COPY app app
COPY run.sh run.sh
# copy test scripts
ONTEST COPY tests tests
# copy “main” test command
ONTEST COPY test.sh test.sh
# auto-generated volume for test results
# ONTEST VOLUME “/var/tests/results”
# … EXPOSE, RUN, ADD … for app and test environment
# main app command
CMD [run.sh, “<app arguments>”]
# main test command
ONTEST CMD [/test.sh, “<test arguments>”]

Integration Test Containers:

They are Special containers that contain only testing tools and test artifacts: test scripts, test data, test environment configuration, etc. An example construct is illustrated in Picture 4.

Tooling

Secured Image Registry & Security Scanning

Security scanning of Docker images is to take a Docker image and cross-reference the software it contains against a list of known vulnerabilities to produce a “bill of health” for the image. Based on this information, organizations can then take action to mitigate vulnerabilities.

Sample product offerings include:

Atomic Scan from Red Hat,
Clair from CoreOS,
Docker Security Scanning from Docker Inc.
Aqua Security
Twistlock

Example from Aqua Security:

Container Orchestration

Kubernetes*

The Google-designed Kubernetes is an open-source system for Docker container management and orchestration. Kubernetes uses a single master server that manages multiple nodes using the command-line interface kubectl.

The basic unit of scheduling is a “pod,” a group of typically one to five containers that are deployed together on a single node in order to execute a particular task. Pods are temporary — they may be generated and deleted at will while the system is running. Higher level concepts such as Deployments can be constructed as a set of pods.

Users can set up custom health checks, including HTTP checks and container execution checks, on each pod in order to ensure that applications are operating correctly.

Docker Swarm

Docker Swarm is Docker’s own tool for cluster management and orchestration and was recently introduced into Docker Engine as “swarm mode” with the Docker 1.12 update, which added support to the Docker Engine for multi-host and multi-container orchestration.

Administrators and software developers can create and manage a virtual system known as a “swarm” that is composed of one or more Docker nodes. The Docker Container deployments are typically handled via Docker Compose or the Docker command line. Docker claims that the software can handle up to 30,000 containers and clusters of up to 1,000 nodes, without suffering any dip in performance.

Marathon

Marathon is a production-grade open-source framework for container management and orchestration that is based on Apache Mesos and intended to work with applications or services that will run over a long period of time.

Marathon is a fully REST-based solution and can also be operated using a web user interface. In order to guard against failure, Marathon can run multiple schedulers at once so that the system can continue if one scheduler crashes.

Like Kubernetes, Marathon allows you to run regular health checks, so you stay up to date on the status of your applications. Another benefit of Marathon is its maturity; the software is stable and has a variety of useful features such as health checks, event subscriptions, and metrics.

AWS ECS

Amazon EC2 Container Service is a container management service for Docker containers. Importantly, any containers managed by Amazon ECS will be run only on instances of Amazon Web Services EC2; so far, there is no support for external infrastructure.

It provides access to AWS features such as elastic load balancing, which redistributes application traffic to provide better performance under pressure, and CloudTrail, a logging and monitoring application.

Tasks are the basic unit of Amazon ECS and are grouped into services by the task scheduler. Persistent data storage can be accomplished via data volumes or Amazon Elastic File System.

HashiCorp Nomad

Hashicorp’s Nomad is an open source offering that can support Docker containers as well as VMs and standalone applications.

Nomad works on the agent model, with an agent deployed on each host which communicates with the central Nomad servers.

The Nomad servers take care of job scheduling based on which hosts have available resources. Nomad can span data centers and also integrate with other Hashicorp tools like Consul.

*Kubernetes is recommended for its portability and more matured. tested multi-cloud on-premise deployment support, and large deployment base.

Service Tracing

With the adoption of microservices, problems emerge due to the sheer number of services that exist in a larger system. Problems that had to be solved once for a monolith, like security, load balancing, monitoring, and rate limiting need to be handled for each service.

Kubernetes and Services

Kubernetes supports a microservices architecture through the Service construct. It allows developers to abstract away the functionality of a set of Pods, and expose it to other developers through a well-defined API. It allows adding a name to this level of abstraction and performs rudimentary L4 load balancing. But it doesn’t help with higher-level problems, such as L7 metrics, traffic splitting, rate limiting, circuit breaking, etc.

Istio addresses these problems in a fundamental way through a service mesh framework. With Istio, developers can implement the core logic for the microservices, and let the framework take care of the rest — traffic management, discovery, service identity and security, and policy enforcement. Better yet, this can be also done for existing microservices without rewriting or recompiling any of their parts. Istio uses Envoy as its runtime proxy component and provides an extensible intermediation layer which allows global cross-cutting policy enforcement and telemetry collection.

The current release of Istio is targeted to Kubernetes users and is packaged in a way that you can install in a few lines and get visibility, resiliency, security and control for your microservices in Kubernetes out of the box.

Logging, Monitoring & Metrics

Fluentd

Fluentd is an open-source data collector designed to unify logging infrastructure. It brings operations

Fluentd has four key features as reliable logging pipelines:

Unified Logging with JSON: it structures data as JSON as much as possible. This allows Fluentd to unify all facets of processing log data: collecting, filtering, buffering, and outputting logs across multiple sources and destinations. The downstream data processing is much easier with JSON, since it has enough structure to be accessible without forcing rigid schemas.
Pluggable Architecture: it has a flexible plugin system that allows the community to extend its functionality. Over 300 community-contributed plugins connect dozens of data sources to dozens of data outputs, manipulating the data as needed. By using plugins, you can make better use of your logs right away.
Minimum Resources Required: it is written in a combination of C and Ruby, and requires minimal system resources. The vanilla instance runs on 30–40MB of memory and can process 13,000 events/second/core.
Built-in Reliability: its supports memory- and file-based buffering to prevent inter-node data loss. Fluentd also supports robust failover and can be set up for high availability.

Fluentd also integrates with Elasticsearch and has plug-in to Kubernetes.

AWS CloudWatch (integrated with ECS)

The advantage of using ECS is it comes with the integration with AWS CloudWatch and CloudWatch Logs includes the ability to create metrics filters that can alarm when there are too many errors and integrates with Amazon Elasticsearch Service and Kibana to enable you to perform powerful queries and analysis. However, it is only available on AWS.

Governance

The typical workflow of container development and governance is illustrated in the diagram as follows.

Image Provenance

A secure labeling system that identifies exactly and incontrovertibly where containers running in the production environment came from.

The gold standard for image provenance is Docker Content Trust. With Docker Content Trust enabled, a digital signature is added to images before they are pushed to the registry. When the image is pulled, Docker Content Trust will verify the signature, thereby ensuring the image comes from the correct organization and the contents of the image exactly match the image that was pushed. This ensures attackers did not tamper with the image, either in transit or when it was stored at the registry. Other, more advanced, attacks — such as rollback attacks and freeze attacks — are also prevented by Docker Content Trust, through its implementation of The Update Framework (TUF).

Image Lineage Tracing

Tools to track the lineage of container images to examine the composition of container images: see Picture 7. Tools such as dockvitz and imageLayer help visualized the lineage tree of container images.

Developer Experience

Creating a self-service catalogue with parameterized predefined templates and automated Dev/Test environment provision workflow using CI/CD tool-chain can help increase developer productivity and reduce operational overheads.

Access Control

The recommended strategy is to develop a centralized access policy administration point (PAP) with standard protocols that enables policy changes and updates to propagate to policy enforcement points (PEP) at service management, which manages and enforces endpoint security, or container registry.

Audit Trail

Establish centralized logging for containers and using tools such as Elasticsearch and Kibana to preserve audit trail and assist in compliance and security analyses.