Indestructible applications with progressive delivery
Progressive delivery allows us to put a new application version in production while keeping the previous one running and controlling how much traffic each gets.
Strategies
Replicas Rollout
By progressively increasing the replicas of the new version while taking the old one down to zero.
Traffic Split
Through some network mechanism (Load balancer, Ingress, Mesh, Network Interface) the traffic gets progressively shifted to the new version of the application.
Feature Flags
This is not a deployment strategy, but a release strategy.
The application itself “decides” when to expose the new functionality based on some feature flag. It has the flexibility of rolling out randomly or by client, type of client, user, location etc. It allows us to separate deployment from release.
Each of these approaches has its merits and drawbacks. It’s possible to use Replicas Rollout or Traffic Split during deployment and Feature Flags to release the functionality to the users.
Why should we deliver progressively?
Through progressive delivery, we can reduce both the risk and the impact of a new (bad) deployment. It gives us the chance and the time to analyse the behaviour of our application in the production environment with real traffic. Since the old version of the application is still running, the rollback can be much faster, reducing our MTTR.
Flagger: progressive delivery in Kubernetes
Part of the FluxCD ecosystem, Flagger can do progressive delivery for us. It supports multiple mechanisms but today we’ll be using Nginx Ingress
.
Canary
The new version of the application that slowly starts getting traffic is called canary (from the canaries used as sentinels in coal mines).
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: app
namespace: default
spec:
provider: nginx
targetRef:
apiVersion: apps/v1
kind: Deployment
name: app
[...]
A Canary
references a Deployment
and it manages it for us.
Quirks
There are a few things that might be counterintuitive when using Flagger.
We define the Canary Deployment.
The Deployment
we define in our manifests is not the primary one but it’s the canary Deployment.
Flagger will create a “shadow” Deployment
with name: <deployment_name>-primary
. The Pods
from the “shadow” deployment will normally handle production traffic. We should not modify this Deployment
, it’s managed by Flagger.
The Deployment
we define is normally scaled to 0 unless a deployment of a new version is happening, in which case its Pods
will have the new application version.
Canaries get discarded.
Once the analysis is successful and the canary is promoted, the canary Pods
don’t become the primary.
While the canary instances are still running, the primary Deployment
gets updated with the same version as the canary and re-deployed. Once the new version is up and healthy on the primary Deployment
, the canary Deployment
gets scaled down to 0.
Sequence of events
Sequence of events for a Canary Promotion (Click the Deploy!
button to start).
Sequence:
- Deployment version change: It can be done manually or by any external tool (FluxCD, ArgoCD, etc.):
name: app
image: image:v1->image:v2
- Flagger increases the number of replicas for the canary deployment (the primary deployment stays unchanged).
name: app
replicas: 0->1
- Flagger starts sending some traffic to the canary service.
- Flagger validates the health of the new pods. If the new version fails any validation it gets scaled down to 0 and the release is aborted.
- Flagger increases the traffic to the canary service and keeps running the analysis.
- Flagger copies the canary configuration to the primary.
name: app-primary
image: image:v1->image:v2
- Flagger scales down the canary deployment.
name: app
replicas: 1->0
Analysis
How does Flagger know if our canary is ok?
The same way we (should) know: metrics!
Metrics are part of the Canary
analysis
definition.
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: app
namespace: default
spec:
[...]
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: response-duration
thresholdRange:
max: 100
interval: 1m
Flagger ships out-of-the-box with request-success-rates
and request-duration
queries that it can get from Prometheus. We can also add our queries.
Analysis failure
By default, Flagger will retry running the analysis since it can fail to run for many reasons. e.g. the metric is not available right away when the Canary deployment starts
If the analysis keeps failing the release is aborted. Since the primary deployment has not been modified, Flagger only needs to scale down the number of replicas in the Canary deployment to 0.
Metric Template
We can define our metrics with a MetricTemplate
and reference them from our canary.
For example: the default request-success-rate
considers success as anything that is not a 500 error, but we might want to consider 2xx and 3xx responses only. In this scenario, we are using Prometheus metrics.
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: strict-request-success-rate
namespace: default
spec:
provider:
type: prometheus
address: http://flagger-prometheus.ingress-nginx:9090
query: |
sum(
rate(
nginx_ingress_controller_requests{
namespace="",
ingress="",
canary!="",
status~="[23].*"
}[]
)
)
/
sum(
rate(
nginx_ingress_controller_requests{
namespace="",
ingress="",
canary!=""
}[]
)
)
* 100
We can now reference this template in our Canaries
using templateRef
:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: golang-api
namespace: default
spec:
[...]
metrics:
- name: request-success-rate
templateRef:
name: strict-request-success-rate
namespace: default
thresholdRange:
min: 99
interval: 1m
Ready to Experiment?
Head over to this repo and try Flagger in a local cluster. You’ll find all the instructions in the Readme.
Try using a slow or faulty application version and watch how Flagger will stop you from completely breaking “production”.