Cloudflare Prometheus Exporter

In this post, I walk you through setting up a Cloudflare Prometheus Exporter and configure the necessary scrape config, an example dashboard, and some alerting rules. These examples demonstrate how to deploy the exporter on a Kubernetes cluster with Prometheus set up implemented using the Prometheus Operator. However, this is no hard requirement, and you can adapt the example to fit your infrastructure setup.

A worth of caution, be aware the API’s used by the exporter is only available to Cloudflare Enterprise customers, so make sure you double-check that before you put in some effort setting things up.

Deploying the Cloudflare Exporter

First of all, big thanks to Wehkamp for open sourcing their Cloudflare exporter for others to enjoy using 🎉. You can check out the exporter repository on Github wehkamp/docker-prometheus-cloudflare-exporter for more information on which metrics the exporter exposes.

Below you see a standard Kubernetes Deployment manifest

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
 apiVersion: apps/v1
 kind: Deployment
 metadata:
   name: cloudflare-exporter
   namespace: monitoring
 spec:
   replicas: 1
   selector:
     matchLabels:
       app.kubernetes.io/name: cloudflare-exporter
       cloudflare/zone: your-company-com
   template:
     metadata:
       labels:
         app.kubernetes.io/name: cloudflare-exporter
         cloudflare/zone: your-company-com
     spec:
       containers:
         - name: cloudflare-exporter
           image: "wehkamp/prometheus-cloudflare-exporter:1.1.1"
           imagePullPolicy: IfNotPresent
           env:
             - name: ZONE
               value: "your-company.com"
             - name: AUTH_EMAIL
               value: "[email protected]"
             - name: AUTH_KEY
               valueFrom:
                 secretKeyRef:
                   name: cloudflare-api-key-secret
                   key: api-key
           ports:
             - name: metrics
               containerPort: 9199
               protocol: TCP

A couple of things I want to point out

  1. Check Docker Hub you use the latest released image, the instructions in their README point to version 1.0, however, 1.1.1 is the latest available tag.
  2. As you can see, we specify the ZONE environment variable. Unfortunately, the exporter can only handle one specific zone at the time, so if your company owns multiple zones, you will need to deploy a dedicated exporter per zone to handle it.
  3. Authentication! The exporter still uses Cloudflare’s Global API key secrets to authenticate and not their API Tokens, which means you can’t limit the API Key scope. However, since Cloudflare doesn’t support any notion of organization-wide API keys/tokens, I recommend configuring a “robot” account in Cloudflare with restricted access to the specific zone being monitored and limit the scope of that account to establish the same goal. The added benefit is you don’t need to worry about offboarding ex-colleagues and to brake your monitoring in the meanwhile.

ps: make sure you configured a Kubernetes secret with the name cloudflare-api-key-secret and a key api-key with the Global API key.

Set up the scrape config a.k.a. Service Monitors

Now the exporter runs it’s time to scrape it to get the metrics into Prometheus. To configure the scrape config, we are going to define a ServiceMonitor CRD resource. This CRD comes installed with the Prometheus Operator and gives you the ability to configure scrape config using Kubernetes resources. The Prometheus Operator automatically gathers these ServiceMonitor objects and updates the Prometheus configuration accordingly. Let’s see how that looks like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
---
 apiVersion: monitoring.coreos.com/v1
 kind: ServiceMonitor
 metadata:
   name: cloudflare-exporter
   namespace: monitoring
   labels:
     app.kubernetes.io/name: cloudflare-exporter
 spec:
   selector:
     matchLabels:
       app.kubernetes.io/name: cloudflare-exporter
   namespaceSelector:
     matchNames:
       - monitoring
   endpoints:
     - port: metrics
       interval: 30s 

It looks like a normal Kubernetes Manifest, however, this won’t work just yet, we still need to define an actual service for it to monitor

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
---
 apiVersion: v1
 kind: Service
 metadata:
   name: cloudflare-exporter
   namespace: monitoring
   labels:
     app.kubernetes.io/name: cloudflare-exporter
 spec:
   type: ClusterIP
   ports:
     - port: 9199
       targetPort: metrics
       protocol: TCP
       name: metrics
   selector:
     app.kubernetes.io/name: cloudflare-exporter

This is a regular Kubernetes service manifest. There is one trick I wanted to point out. The careful reader might have noticed I only configured the app.kubernetes.io/name: cloudflare-exporter label as the selector. But didn’t add the cloudflare/zone label. This little “trick” makes sure that when you add more exporter deployment to monitor different zones, they will automatically be picked up by the ServiceMonitor without any changes.

The exported metrics include the zone they are watching, so you don’t need any custom configuration to relabel metrics to include the exporter zone label to find out which zone the metrics belong to.

Configure a Grafana dashboard

The Cloudflare Prometheus Exporter already comes with an example dashboard. However, it’s a plain JSON model, which makes it a rather long blob of text. As a shameless plug, I want to share one of my previous posts where I explain how to use Grafonnet to generate dashboards instead. In this gist you can find an example Cloudflare Grafonnet dashboard which we use today, this dashboard borrows a few concepts of the Gitlab’s runbook repository. We’ve abstracted away a couple of standardized components and helpers, so instead of ~500 line JSON blob, you get a ~90 line Grafonnet definition, which is much easier to comprehend.

Alerting rules

Like with the scrape config, the Prometheus Operator additionally offers a PrometheusRule CRD resource to configure alerting rules (or recording rules). Below we will discuss two examples of alerting rules and their purpose.

  1. The missing metric alert

    1
    2
    
    - alert: CloudflareExporterScrapeMissing
       expr: absent(sum by(zone) (cloudflare_pop_http_responses_sent))

    This alert makes sure we are alerted when we are missing any Cloudflare metrics. It’s always crucial to also report when you are missing metrics

  2. The elevated error rate alert

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    
    - alert: CloudFlareElevatedElevatedErrorRate
       expr: |
         (
           sum by (zone) (
             rate(cloudflare_pop_http_responses_sent{http_status=~"5.."}[5m])
           )
           /
           on (zone) sum by (zone) (
             rate(cloudflare_pop_http_responses_sent[5m])
           )
         ) > 0.05     
       for: 1m

    The elevated error alert is why we did all this trouble setting up the exporter in the first place. It takes the 5xx HTTP responses from the previous 5 minutes and divides them by the total amount of HTTP responses from the previous 5 minutes, to calculate a percentage of request that is erroneous. If that percentage crosses 5%, we will trigger a notification.

Finally you wrap these alerts into a PrometheusRule manifest, like this

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
---
 apiVersion: monitoring.coreos.com/v1
 kind: PrometheusRule
 metadata:
   name: cloudflare-alerts.rules
   namespace: monitoring
 spec:
   groups:
     - name: cloudflare.rules
       rules:
         - alert: CloudflareExporterScrapeMissing
           expr: absent(sum by(zone) (cloudflare_pop_http_responses_sent))
           for: 1m
           labels:
             severity: s4
             alert_type: cause
             pager: pagerduty
             environment: production
           annotations:
             summary: Scrape errors in Cloudflare exporter
             description: >
               The cloudflare exporter has failed to scrape Cloudflare. Note that
               this refers to a background scrape loop in the exporter itself, and
               that prometheus may be successfully scraping the exporter.               
             grafana_dashboard_id: "xxx/cloudflare-pop-statistics"
         - alert: CloudFlareElevatedElevatedErrorRate
           expr: |
             (
               sum by (zone) (
                 rate(cloudflare_pop_http_responses_sent{http_status=~"5.."}[5m])
               )
               /
               on (zone) sum by (zone) (
                 rate(cloudflare_pop_http_responses_sent[5m])
               )
             ) > 0.05             
           for: 1m
           labels:
             severity: s1
             alert_type: cause
             pager: pagerduty
             environment: production
           annotations:
             summary: "Elevated Cloudflare error rate for the `{{ $labels.zone }}`"
             description: >
                              `{{ $labels.zone }}` is receiving an elevated error rate above 5% off all traffic for the last 5 minutes
             grafana_dashboard_id: "xxxx/cloudflare-pop-statistics"

As you can see, we added a couple of labels and annotations to the alert rules themselves. We use these for several purposes. The annotations help us get a better picture of the specific alert and to direct us to the essential Grafana dashboard we need to investigate to solve the situation. The labels are used mostly to route the alerts to the proper places according to their severity. We might, for example, decide only to send a notification to Slack. But in cases, we need to send an alert to Pagerduty. We provide a specific PagerDuty service label to know which service to assign the alert to, ultimately notifying the right people on-call.

If any of this was helpful or if you have any questions, feel free to reach out on twitter ✌️.