soft realtime ticker on Linux

This post was written in either 2018 or 2019.

Recently, I’ve been having an issue where my game (written in Go) which uses a time.Ticker to tick once every 16ms does not seem to tick exactly once every 16ms. In fact, sometimes a tick can last as long as 39ms. After disabling the Go GC, I was still getting max tick intervals of 24ms.

To narrow down the problem, I wrote simplified Go programs testing only time.Ticker itself. This program for example, which uses only one goroutine running a ticker:

package main

import (
	"fmt"
	"runtime/debug"
	"time"
)

var newrb = make(chan bool, 10)

func getter() {
	for {
		<-newrb
	}
}

func burnCPU() {
	done := make(chan bool)
	for {
		select {
		case <-done:
			return
		default:
			time.Sleep(1 * time.Millisecond)
		}
	}
}

var start time.Time

func critical() {
	debug.SetGCPercent(-1)
	tickIntervalLimit := time.Duration(19 * time.Millisecond)
	ticker := time.NewTicker(time.Duration(16 * time.Millisecond))

	t0 := time.Now()
	maxSeen := time.Duration(0)
	for range ticker.C {
		dur := time.Since(t0)
		if dur > maxSeen {
			maxSeen = dur
			fmt.Println("new record: ", maxSeen)
		}
		if dur > tickIntervalLimit {
			fmt.Println("time: ", time.Since(start))
			panic("time limit exceed, space shuttle exploded")
		}
		t0 = time.Now()
	}
}

func main() {
	start = time.Now()

	for i := 0; i < 100; i++ {
		go getter()
	}

	for i := 0; i < 300; i++ {
		go burnCPU()
	}
	critical()
}

Output:

new record:  16.124647ms
new record:  16.500283ms
new record:  17.155078ms
new record:  17.844003ms
new record:  19.591948ms
time:  12.549167497s
panic: time limit exceed, space shuttle exploded

Okay, so maybe this is a problem with time.Ticker. Let’s write the simplest possible code to replicate the same functionality but without using time.Ticker:

package main

import (
	"fmt"
	"math/rand"
	"time"
)

var start time.Time

func main() {
	start = time.Now()

	tickIntervalLimit := time.Duration(19 * time.Millisecond)
	tickInterval := time.Duration(16 * time.Millisecond)

	tlast := time.Now()
	maxSeen := time.Duration(0)
	for {
		dur := time.Since(tlast)
		tlast = time.Now()

		if dur > maxSeen {
			maxSeen = dur
			fmt.Println("new record: ", maxSeen)
		}
		if dur > tickIntervalLimit {
			fmt.Println("time: ", time.Since(start))
			panic("time limit exceed, space shuttle exploded")
		}

		// do some work
		r := rand.Intn(10)
		time.Sleep(time.Duration(r) * time.Microsecond)
		elapsed := time.Since(tlast)

		// sleep the appropriate amount of time
		time.Sleep(tickInterval - elapsed)

	}
}

Gave me the following results after running for a while:

new record:  330ns
new record:  16.092566ms
new record:  16.097095ms
new record:  16.119106ms
new record:  16.186136ms
new record:  16.224875ms
new record:  16.242395ms
new record:  16.278755ms
new record:  16.336664ms
new record:  16.339054ms
new record:  16.359964ms
new record:  16.466703ms
new record:  16.477522ms
new record:  16.531921ms
new record:  16.61025ms
new record:  16.89469ms
new record:  16.896275ms
new record:  17.653027ms
new record:  17.822675ms

If you try to add any extra goroutines to this program (that either wait on a channel or just busy-wait sleeping for 1 millisecond repeatedly), you will make it explode instantly:

package main

import (
	"fmt"
	"math/rand"
	"time"
)

var channel = make(chan bool, 10)

func getter() {
	for {
		<-channel
	}
}

func burnCPU() {
	done := make(chan bool)
	for {
		select {
		case <-done:
			return
		default:
			time.Sleep(1 * time.Millisecond)
		}
	}
}

var start time.Time

func main() {
    debug.SetGCPercent(-1)
	start = time.Now()

	for i := 0; i < 50; i++ {
		go getter()
	}

	for i := 0; i < 200; i++ {
		go burnCPU()
	}

	tickIntervalLimit := time.Duration(19 * time.Millisecond)
	tickInterval := time.Duration(16 * time.Millisecond)

	tlast := time.Now()
	maxSeen := time.Duration(0)
	for {
		dur := time.Since(tlast)
		tlast = time.Now()

		if dur > maxSeen {
			maxSeen = dur
			fmt.Println("new record: ", maxSeen)
		}
		if dur > tickIntervalLimit {
			fmt.Println("time: ", time.Since(start))
			panic("time limit exceed, space shuttle exploded")
		}

		// do some work
		r := rand.Intn(10)
		time.Sleep(time.Duration(r) * time.Microsecond)
		elapsed := time.Since(tlast)

		// sleep the appropriate amount of time
		time.Sleep(tickInterval - elapsed)

	}
}

Output:

new record:  260ns
new record:  16.085476ms
new record:  16.086106ms
new record:  16.089686ms
new record:  16.109126ms
new record:  16.114356ms
new record:  16.128786ms
new record:  16.131925ms
new record:  16.136616ms
new record:  16.195085ms
new record:  16.212755ms
new record:  17.569533ms
new record:  19.1844ms
time:  5.620648333s
panic: time limit exceed, space shuttle exploded

For comparison’s sake, let’s try a busy-wait version:

package main

import (
	"fmt"
	"math/rand"
	"runtime/debug"
	"time"
)

var channel = make(chan bool, 10)

func getter() {
	for {
		<-channel
	}
}

func burnCPU() {
	done := make(chan bool)
	for {
		select {
		case <-done:
			return
		default:
			time.Sleep(1 * time.Millisecond)
		}
	}
}

var start time.Time

func main() {
	debug.SetGCPercent(-1)
	start = time.Now()

	for i := 0; i < 50; i++ {
		go getter()
	}

	for i := 0; i < 200; i++ {
		go burnCPU()
	}

	tickIntervalLimit := time.Duration(19 * time.Millisecond)
	tickInterval := time.Duration(16 * time.Millisecond)

	tlast := time.Now()
	maxSeen := time.Duration(0)
	for {
		dur := time.Since(tlast)
		tlast = time.Now()

		if dur > maxSeen {
			maxSeen = dur
			fmt.Println("new record: ", maxSeen)
		}
		if dur > tickIntervalLimit {
			fmt.Println("time: ", time.Since(start))
			panic("time limit exceed, space shuttle exploded")
		}

		// do some work
		r := rand.Intn(10)
        time.Sleep(time.Duration(r) * time.Microsecond)

        // busy wait
		for time.Since(tlast) < tickInterval {
		}

	}
}

Output:

new record:  280ns
new record:  16.000127ms
new record:  16.000207ms
new record:  16.000226ms
new record:  16.000356ms
new record:  16.000447ms
new record:  18.20143ms
new record:  18.280249ms
new record:  21.536809ms
time:  2m41.030212583s

Instant explosion (okay we had to wait 2 minutes to see this one explode - sometimes it explodes faster than others - but usually you see 18ms pretty quickly). So even busy-wait doesn’t work. Since we already disabled Go’s garbage collector, this is clearly an issue with the Go scheduler, right? So if it’s a problem with Go then let’s switch to C++ and see if it still happens. Let’s use nanosleep to begin with:

#include <stdint.h>
#include <chrono>
#include <iostream>
#include <iomanip>
#include <random>
#include <thread>

int64_t now_in_ns()
  {
  return std::chrono::duration_cast<std::chrono::nanoseconds>(
    std::chrono::system_clock::now().time_since_epoch()).count();
  }

int main(){

    int64_t trials              = 100000;
    double   overall_total_clock = 0.0;

    std::cout << "\n\n16ms Loops:" << std::endl;
    std::cout << "        ns      err(ns)" << std::endl;

    int64_t biggestSeenDiff = 0;
    int64_t tlast = now_in_ns();
    std::this_thread::sleep_for(std::chrono::nanoseconds(16000000));
    while(true)
      {
          int64_t dur = now_in_ns() - tlast;
          tlast = now_in_ns();

          int64_t diff      = 16000000 - dur;
          if (std::abs(diff) > std::abs(biggestSeenDiff)){
              biggestSeenDiff = diff;
              std::cout << "new record: " << diff/1000 << "us" << std::endl;
          }

          int64_t elapsed = now_in_ns() - tlast;
          std::this_thread::sleep_for(std::chrono::nanoseconds(16000000 - elapsed));
      }
}

Output after running for a couple of minutes:

16ms Loops:
        ns      err(ns)
new record: -86us
new record: -86us
new record: -89us
new record: -93us
new record: -94us
new record: -160us
new record: -176us
new record: -190us
new record: -260us
new record: -314us
new record: -323us
new record: -372us
new record: -386us
new record: -701us

Okay, not as bad as Go (at least we’re not seeing >1 millisecond delays) but still pretty bad. And we haven’t even added any additional threads yet. Let’s add some extra threads and see what happens. In order to avoid having to deal with the overhead of OS thread context switch, we will use boost::fiber (to compile this program you will need to add some compiler flags like -lpthread -lboost_fiber -lboost_context):

#include <stdint.h>
#include <chrono>
#include <iostream>
#include <iomanip>
#include <random>
#include <thread>

#include <boost/assert.hpp>
#include <boost/fiber/all.hpp>
#include <boost/fiber/detail/thread_barrier.hpp>
#include <boost/fiber/algo/round_robin.hpp>

int64_t now_in_ns()
  {
  return std::chrono::duration_cast<std::chrono::nanoseconds>(
    std::chrono::system_clock::now().time_since_epoch()).count();
  }

void f() {
    while (true) {
        boost::this_fiber::sleep_for(std::chrono::milliseconds(1));
        boost::this_fiber::yield();
    }
}

int main(){

    double   overall_total_clock = 0.0;

    std::cout << "\n\n16ms Loops:" << std::endl;
    std::cout << "        ns      err(ns)" << std::endl;

    int64_t biggestSeenDiff = 0;

    std::vector<boost::fibers::fiber*> v;
    for (int i = 0; i < 200; i++){
        boost::fibers::fiber t1(f);
        t1.detach();
    }

    int64_t tlast = now_in_ns();
    boost::this_fiber::sleep_for(std::chrono::nanoseconds(16000000));
    while(true)
      {
          int64_t dur = now_in_ns() - tlast;
          tlast = now_in_ns();

          int64_t diff      = 16000000 - dur;
          if (std::abs(diff) > std::abs(biggestSeenDiff)){
              biggestSeenDiff = diff;
              std::cout << "new record: " << diff/1000 << "us" << std::endl;
          }

          int64_t elapsed = now_in_ns() - tlast;
          boost::this_fiber::sleep_for(std::chrono::nanoseconds(16000000 - elapsed));
      }
}

Output:

16ms Loops:
        ns      err(ns)
new record: -60us
new record: -61us
new record: -149us
new record: -165us
new record: -497us
new record: -1169us
new record: -2108us
new record: -2192us
new record: -2727us
new record: -3893us
new record: -3907us
new record: -4010us
new record: -4080us
new record: -5419us

At this point, I’m pretty sure these delays are being caused by the Linux system scheduler running other threads instead of resuming this thread instantly as soon as the nanosleep finishes.

So I googled and I found this answer on Stack Overflow:

Typically, a game will keep drawing new frames all the time even when the user does not do anything. This would normally happen where you have your onIdle() call. If your game only updates the window/screen when the user presses a button or such, or sporadically in between, then MSGWaitForMultipleObjects is a good option.

However, in a continuous-animation game, you would normally not want to block or sleep in the render thread at all if you can help it, instead you want to render at maximum speed and rely on vertical sync as a throttle. The reason for that is that timing, blocking, and sleeping is unprecise at best, and unreliable at worst, and it will quite possibly add disturbing artefacts to your animations.

What you normally want to do is push everything belonging to a frame to the graphics API (most likely OpenGL since you said “any platform”) as fast as you can, signal a worker thread to start doing the game logic and physics etc, and then block on SwapBuffers.

All timers, timeouts, and sleep are limited by the scheduler’s resolution, which is 15ms under Windows (can be set to 1ms using timeBeginPeriod). At 60fps, a frame is 16.666ms, so blocking for 15ms is catastrophic, but even 1ms is still a considerable time. There is nothing you can do to get a better resolution (this is considerably better under Linux).

Sleep, or any blocking function that has a timeout, guarantees that your process sleeps for at least as long as you asked for (in fact, on Posix systems, it may sleep less if an interrupt occurred). It does not give you a guarantee that your thread will run as soon as the time is up, or any other guarantees.

Sleep(0) under Windows is even worse. The documentation says “the thread will relinquish the remainder of its time slice but remain ready. Note that a ready thread is not guaranteed to run immediately”. Reality has it that it works kind of ok most of the time, blocking anywhere from “not at all” to 5-10ms, but on occasions I’ve seen Sleep(0) block for 100ms too, which is a desaster when it happens.

Using QueryPerformanceCounter is asking for trouble – just don’t do it. On some systems (Windows 7 with a recent CPU) it will work just fine because it uses a reliable HPET, but on many other systems you will see all kinds of strange “time travel” or “reverse time” effects which nobody can explain or debug. That is because on those systems, the result of QPC reads the TSC counter which depends on the CPU’s frequency. CPU frequency is not constant, and TSC values on multicore/multiprocessor systems need not be consistent.

Then there is the synchronization issue. Two separate timers, even when ticking at the same frequency, will necessarily become unsynchronized (because there is no such thing as “same”). There is a timer in your graphics card doing the vertical sync already. Using this one for synchronisation will mean everything will work, using a different one will mean things will eventually be out of sync. Being out of sync may have no effect at all, may draw a half finished frame, or may block for one full frame, or whatever.

While missing a frame is usually not that much of a big deal (if it’s just one and happens rarely), in this case it is something you can totally avoid in the first place.

Okay, so maybe nanosleep is the cause of the problem. Let’s try busy waiting instead.

#include <stdint.h>
#include <chrono>
#include <iostream>
#include <iomanip>
#include <random>
#include <thread>

   #include <sched.h>

#include <boost/assert.hpp>

#include <boost/fiber/all.hpp>

#include <boost/fiber/detail/thread_barrier.hpp>
#include <boost/fiber/algo/round_robin.hpp>


int64_t now_in_ns()
  {
  return std::chrono::duration_cast<std::chrono::nanoseconds>(
    std::chrono::system_clock::now().time_since_epoch()).count();
  }

void f() {
    while (true) {
        boost::this_fiber::sleep_for(std::chrono::milliseconds(1));
        boost::this_fiber::yield();
    }
}

int main(){
    std::cout << "\n\n16ms Loops:" << std::endl;
    std::cout << "        ns      err(ns)" << std::endl;

    int64_t biggestSeenDiff = 0;

    std::vector<boost::fibers::fiber*> v;
    for (int i = 0; i < 200; i++){
        boost::fibers::fiber t1(f);
        t1.detach();
    }

    int64_t tlast = now_in_ns();
    boost::this_fiber::sleep_for(std::chrono::nanoseconds(16000000));
    while(true)
      {
          int64_t dur = now_in_ns() - tlast;
          tlast = now_in_ns();

          int64_t diff      = 16000000 - dur;
          if (std::abs(diff) > std::abs(biggestSeenDiff)){
              biggestSeenDiff = diff;
              std::cout << "new record: " << diff/1000 << "us" << std::endl;
          }

          while (now_in_ns() - tlast < 16000000){
              //do nothing. busy wait.
          }

      }
}

Output:

16ms Loops:
        ns      err(ns)
new record: -55us
new record: -58us
new record: -168us
new record: -317us
new record: -2829us

Nuh-uh. Seems like you can still get multi-millisecond pauses even if you busy-wait. Looking around some more I found this Dr.Dobb’s article which tells us that we have to set the scheduler for our process to round robin and the priority to 99 (top shows this priority as rt). So let’s try nanosleep again this time with rt priority and SCHED_RR:

#include <stdint.h>
#include <chrono>
#include <iostream>
#include <iomanip>
#include <random>
#include <thread>

       #include <sched.h>

#include <boost/assert.hpp>

#include <boost/fiber/all.hpp>

#include <boost/fiber/detail/thread_barrier.hpp>
#include <boost/fiber/algo/round_robin.hpp>

int64_t now_in_ns()
  {
  return std::chrono::duration_cast<std::chrono::nanoseconds>(
    std::chrono::system_clock::now().time_since_epoch()).count();
  }

void f() {
    while (true) {
        boost::this_fiber::sleep_for(std::chrono::milliseconds(1));
        boost::this_fiber::yield();
    }
}

int main(){
    struct sched_param sp = {
        .sched_priority = 99
    };
    if (sched_setscheduler(0, SCHED_RR, &sp) != 0){
        std::cout << "failed to set scheduling policy. need root (use sudo).";
        exit(1);
    }

    std::cout << "\n\n16ms Loops:" << std::endl;
    std::cout << "        ns      err(ns)" << std::endl;

    int64_t biggestSeenDiff = 0;

    std::vector<boost::fibers::fiber*> v;
    for (int i = 0; i < 200; i++){
        boost::fibers::fiber t1(f);
        t1.detach();
    }

    int64_t tlast = now_in_ns();
    boost::this_fiber::sleep_for(std::chrono::nanoseconds(16000000));
    while(true)
      {
          int64_t dur = now_in_ns() - tlast;
          tlast = now_in_ns();

          int64_t diff      = 16000000 - dur;
          if (std::abs(diff) > std::abs(biggestSeenDiff)){
              biggestSeenDiff = diff;
              std::cout << "new record: " << diff/1000 << "us" << std::endl;
          }

          int64_t elapsed = now_in_ns() - tlast;
          boost::this_fiber::sleep_for(std::chrono::nanoseconds(16000000 - elapsed));
      }
}

Output:

16ms Loops:
        ns      err(ns)
new record: -58us
new record: -58us
new record: -81us
new record: -89us
new record: -115us
new record: -124us
new record: -158us
new record: -162us
new record: -168us
new record: -169us
new record: -222us
new record: -254us
new record: -372us
new record: -808us

Still not the microsecond precision I wanted. What’s going on? Well, the Dr.Dobb’s article explains it:

Of the several methods of setting a timer in Linux, the easiest to use (at least of the timers that you’re likely to care to use in any sort of real-time app) is the system call nanosleep(). With nanosleep(), you can put your application to sleep for a specified amount of time, and the kernel will wake it up at some point not before (and usually very soon after) the time you give. Don’t let the name fool you, though. Although nanosleep() takes a timespec struct that allows you to give an accuracy with nanosecond precision, nanosleep doesn’t really give anywhere near that level of accuracy. Under Linux, the kernel only checks on timers once every “jiffie,” which defaults to 1 ms under the 2.6 kernel (on older kernels it was 10 ms, and 10 ms is still a compile option in the 2.6 kernel). Once upon a time, nanosleep would busy-wait if the sleep time given was less than 2 ms, but not anymore. On the other hand, don’t be too discouraged—as long as you don’t need submillisecond accuracy, nanosleep() is very good, as you can see in Table 1. (test.c, the code that generated the results in Table 1, can be found online at http:// www.cuj.com/code/.) There is an overhead of about 2 ms for each call, but it’s consistent enough that you can compensate for it if you need to (usually by undersleeping for 2 ms, possibly with a busy-wait to pad the small amount of extra time if you need to ensure that you don’t finish early).

Oh, so nanosleep only has millisecond precision. Nice naming there.

Okay so let’s adopt William Nagel’s advice and nanosleep for 14 milliseconds then busy-wait the remaining 2 milliseconds (if you try to busy-wait the entire 16 milliseconds then your system will freeze - I tried it - so don’t do that):

#include <stdint.h>
#include <chrono>
#include <iostream>
#include <iomanip>
#include <random>
#include <thread>

   #include <sched.h>

#include <boost/assert.hpp>

#include <boost/fiber/all.hpp>

#include <boost/fiber/detail/thread_barrier.hpp>
#include <boost/fiber/algo/round_robin.hpp>

// TEST only: from here on
int64_t now_in_ns()
  {
  return std::chrono::duration_cast<std::chrono::nanoseconds>(
    std::chrono::system_clock::now().time_since_epoch()).count();
  }

void f() {
    while (true) {
        boost::this_fiber::sleep_for(std::chrono::milliseconds(1));
        boost::this_fiber::yield();
    }
}

int main(){
    struct sched_param sp = {
        .sched_priority = 99
    };
    if (sched_setscheduler(0, SCHED_RR, &sp) != 0){
        std::cout << "failed to set scheduling policy";
        exit(1);
    }

    std::cout << "\n\n16ms Loops:" << std::endl;
    std::cout << "        ns      err(ns)" << std::endl;

    int64_t biggestSeenDiff = 0;

    std::vector<boost::fibers::fiber*> v;
    for (int i = 0; i < 1000; i++){
        boost::fibers::fiber t1(f);
        t1.detach();
    }

    int64_t tlast = now_in_ns();
    boost::this_fiber::sleep_for(std::chrono::nanoseconds(13000000));
    while (now_in_ns() - tlast < 16000000){
        //do nothing. busy wait.
    }
    while(true)
      {
          int64_t dur = now_in_ns() - tlast;
          tlast = now_in_ns();

          int64_t diff      = 16000000 - dur;
          if (std::abs(diff) > std::abs(biggestSeenDiff)){
              biggestSeenDiff = diff;
              std::cout << "new record: " << diff/1000 << "us" << std::endl;
          }

          int64_t elapsed = now_in_ns() - tlast;
          boost::this_fiber::sleep_for(std::chrono::nanoseconds(14000000 - elapsed));

          while (now_in_ns() - tlast < 16000000){
              //do nothing. busy wait.
          }

      }
}

Output (over a period of 3 hours with other processes running):

16ms Loops:
        ns      err(ns)
new record: 0us
new record: 0us
new record: 0us
new record: 0us
new record: 0us
new record: -3us
new record: -4us
new record: -5us
new record: -9us
new record: -22us
new record: -66us
new record: -259us

The results look much better than before, but again not microsecond-level real-time (although looking at osadl latency plots it looks like most modern CPUs, including the Intel i7-3615QE, have worse-case latencies of 30-50 microseconds, so I am probably asking for too much). According to this Stack Overflow answer in order to get “true” realtime I have to patch the Linux kernel with RT_PREEMPT and enable it via CONFIG_RT_PREEMPT. Also I had to wait over 2 hours before I saw the -259us so lengthy pauses should be pretty rare normally (for some definition of “normally” - obviously more testing is required).

For obvious reasons, I find the final solution unsatisfactory. Having to run the program with rt priority and tell the scheduler to schedule it differently feels hacky, but what’s worse is the busy-wait. If the wait timer was less than 2 milliseconds then you’d have to busy-wait the entire time which would freeze the entire system - clearly not appropriate for most normal programs. Also, in order to get it to be “truly” real time you’d have to patch the kernel.

What I learned from this is that soft real-time relies not only on the language not having a garbage collector but also on the language’s scheduler and then finally also relies on the OS scheduler as well. I had previously assumed that Linux was suitably “soft real time” out of the box but it appears I was mistaken - pauses on even very simple userspace programs under normal scheduler and priority can be in the millisecond range even through very short runs (few seconds/minutes). They are not unnoticeable as I had assumed. Although you can’t get true realtime without patching the kernel, use of sched_setscheduler can bring down the maximum pause down a lot in practice, so maybe this hack can be worthwhile if you do not need hard real time and find the default Linux setup unsatisfactory.

UPDATE: I learned recently that it may be possible to achieve soft real time by pinning your process to a single CPU core. This is apparently what some HFT shops do. I will need to investigate this further.

Reference: https://rt.wiki.kernel.org/index.php/Frequently_Asked_Questions