Locking in crypto/rand

Jan 10, 2016 · 8 minute read · Comments
Go Performance

As a followup to my previous post detailing my journey through some profiling of the math/rand package, I wanted to write about the crypto/rand package. A couple people have suggested that I take a look at that instead of worrying about locking in math/rand. On the surface, it’s an easy interface that fills a byte slice full of cryptographically secure random data. I modified the rand_default.go program from the previous post to create a new program to pull data from crypto/rand instead of math/rand. The full text of the program is below:

package main

import "fmt"
import "crypto/rand"
import "os"
import "runtime/pprof"
import "sync"
import "time"

func randData(n int) []byte {
    b := make([]byte, n)
    rand.Read(b)

    for i := range b {
        b[i] = byte('A') + (b[i] % 26)
    }

    return b
}

func toInt(b [4]byte) int {
    i := int(b[0])
    i |= (int(b[1]) << 8 )
    i |= (int(b[2]) << 16)
    i |= (int(b[3]) << 24)

    return i
}

const numRuns = 10

func main() {
    f, err := os.Create("crypto_rand_default.prof")
    if err != nil {
        panic(err.Error())
    }

    pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile()

    start := time.Now()
    for i := 0; i < numRuns; i++ {
        do()
    }
    total := time.Since(start)
    perRun := float64(total) / numRuns

    fmt.Printf("Time per run: %fns\n", perRun)
}

const numRoutines = 10

func do() {
    start := make(chan struct{})
    comm := make(chan []byte)

    var read, write sync.WaitGroup
    read.Add(numRoutines)
    write.Add(numRoutines)

    for i := 0; i < numRoutines; i++ {
        go func() {
            var r [4]byte

            <-start
            for j := 1; j < 10000; j++ {
                _, err := rand.Read(r[:])
                if err != nil {
                    panic(err.Error())
                }
                comm <- randData(toInt(r) % 10000)
            }
            write.Done()
        }()

        go func() {
            var sum int
            <-start
            for c := range comm {
                sum += len(c)
            }
            fmt.Println(sum)
            read.Done()
        }()
    }

    close(start)
    write.Wait()
    close(comm)
    read.Wait()
}

The difference between this program and the original is the use of crypto/rand in the randData function, and the addition of a toInt() function that made my life a bit easier when using byte slices to get randomness. Running the program gave an average time per run of 25.72 seconds, which is almost 2x the original program at 13.17 seconds.

$ go build crypto_rand_default.go
$ ./crypto_rand_default
(output elided)
Time per run: 25717479513.700001ns

Below is the graph generated with the go pprof tool. Profiling stops the program 100 times a second to analyze call stacks to determine timing and the call graph of the program empirically, rather than through source analysis. It also will tell how much time was spent in a particular function and cumulatively in its reachable nodes in the call graph.

crypto_rand_default link to image

This is very interesting, now. The vast, vast majority of the time (97.88% in fact) is spent making syscalls. If we take a look at the Go source code we can see that a call to crypto/rand.Read() will (on linux) call the getrandom(2) syscall to read random data. Nowhere in the Go code does it seem like there’s anything that should stop us from executing as fast as we can, other than the overhead of performing a syscall, which involves some context switching and some buffer copying.

A Bit of Benchmarking

A reader on the previous post (h/t to kune18) wanted to compare the speed of output of the different methods of generating random numbers in raw data per second. He wrote up a playground to demonstrate a reader around the math/rand package. His benchmarks showed 431 MB/s from math/rand and 21 MB/s from crypto/rand. Unfortunately, the playground post wasn’t complete enough to reproduce, so I created my own program to do just that. I got similar results, with 416 MB/s and 13 MB/s respectively. This means a single unfettered math/rand.Rand instance is 31.28x faster than using crypto/rand.Read().

$ go build rand_speed.go && ./rand_speed
4361945120 bytes read in 10 seconds from math/rand
415.9875030517578 MB/s
139436064 bytes read in 10 seconds from crypto/rand
13.297659301757813 MB/s

One Level Deeper

I wanted to understand what was happening at a deeper level than just “the syscalls are taking a long time.” I tried to use strace to find the calls to getrandom() but unfortunately it appears that the strace version that I had my hands on was not aware of it at all, and would give me a syscall_318 instead. Even this was superficial, however, since the tool only showed me what I already knew.

perf_events is a tool much like the runtime/pprof package provides. It allows a user to sample a program’s stacks at a specified frequency to determine where it is spending most of its time. It can do a whole lot more than that, however this was the use I needed. For anyone interested in learning more about perf_events, Brendan Gregg has a very good introduction to the tool.

In order to use perf_events, you must have it installed on your system.

$ sudo apt-get install linux-tools-common linux-tools-generic linux-tools-`uname -r`

To prevent any possible conflict, I removed the code in the test program that performed CPU profiling from within Go. Below is the run used to collect data for analysis of the final program:

$ go build crypto_rand_default.go
$ sudo perf record -F 99 ./crypto_rand_default # sudo needed here to access kernel symbols
(output elided)
Time per run: 25257292731.000000ns
[ perf record: Woken up 8 times to write data ]
[ perf record: Captured and wrote 1.707 MB perf.data (44503 samples) ]

The perf report command brings up a screen in the terminal that reports the time spent in each function in the program and in the kernel that it can report. It is ordered in time spent in the function.

$ sudo perf report

Samples: 44K of event 'cpu-clock', Event count (approx.): 452545450020
Overhead  Command          Shared Object        Symbol
  50.41%  crypto_rand_def  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
  39.55%  crypto_rand_def  [kernel.kallsyms]    [k] extract_buf
   3.70%  crypto_rand_def  [kernel.kallsyms]    [k] __memset
   2.19%  crypto_rand_def  [kernel.kallsyms]    [k] copy_user_generic_string
   1.48%  crypto_rand_def  crypto_rand_default  [.] main.randData
   1.01%  crypto_rand_def  [kernel.kallsyms]    [k] memzero_explicit
   0.39%  crypto_rand_def  [kernel.kallsyms]    [k] finish_task_switch
   0.28%  crypto_rand_def  [kernel.kallsyms]    [k] extract_entropy_user
   0.14%  crypto_rand_def  [kernel.kallsyms]    [k] sha_init
   0.12%  crypto_rand_def  [kernel.kallsyms]    [k] _copy_to_user
   0.12%  crypto_rand_def  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
   0.06%  crypto_rand_def  [kernel.kallsyms]    [k] __do_softirq
   0.06%  crypto_rand_def  [kernel.kallsyms]    [k] retint_careful
   0.04%  crypto_rand_def  crypto_rand_default  [.] runtime.memclr
   0.03%  crypto_rand_def  crypto_rand_default  [.] runtime.mallocgc
   0.03%  crypto_rand_def  crypto_rand_default  [.] runtime.cas
   0.02%  crypto_rand_def  crypto_rand_default  [.] runtime.xchg
   0.02%  crypto_rand_def  crypto_rand_default  [.] runtime.mSpan_Sweep
   0.02%  crypto_rand_def  crypto_rand_default  [.] syscall.Syscall
   0.01%  crypto_rand_def  crypto_rand_default  [.] main.do.func1
...

Now we’re getting somewhere. It seems most of the time is spent in the _raw_spin_unlock_irqrestore and extract_buf functions in the kernel. What do those do? Well, extract_buf turns out to be the function that does the heavy lifting in the code that backs /dev/random/, /dev/urandom/, and the getrandom(2) syscall.

Around the single cryptographically-secure random number generator (CSRNG) core in the kernel is a lock to prevent parallel requests from modifying state at the same time. As well, it disables interrupts while work is done. The lock itself is a spinlock, which is a special kind of lock used is special kinds of situations. The biggest time sink was the function _raw_spin_unlock_irqrestore. This is the architecture dependent function that unlocks a spinlock. The spinlocks themselves are an interesting piece of the kernel, and are heavily architecture dependent, so I won’t get into them here. Needless to say, the spinlock implementation is fun to read.

Oddly, the unlocking phase takes the most amount of time, but my guess here is that the interrupts that were stored up while the RNG code was doing work all got serviced immediately after the lock restored interrupts, and thus the unlock seemingly takes a long time but actually is just interrupted repeatedly.

Another Lock Revealed

Again the details below the surface thwart the programmer who wishes to obtain random data. It makes sense, as before, that there is a lock around the single RNG that is producing numbers. Even with the high-performance implementation of spinlocks, there is no getting away from the serialization of requests for data. As well, it’s helpful to be mindful of the implications of a syscall. Even when the request itself is small, the cost of switching into the kernel can be large.

I hope that these couple articles have demystified the math/rand and crypto/rand packages for you. If you have any questions, please leave a comment. I’d be happy to hear if it helped.

Bonus Section: Which Should You Use?

The first version of this article was missing an important point, which is why there are two different random number sources in the first place in Go. The key difference is in the suitability of the output of each.

math/rand is good if you need some random numbers to be used for testing or to ensure that a part of the program behaves differently each time. For example, the popular go-fuzz package uses math/rand to generate randomness for fuzz testing. Most use cases can be satisfied by using math/rand.

crypto/rand is a different beast entirely. The data returned is guaranteed to be cryptographically secure, and is usable for security purposes such as generating nonces or encryption keys. The speed of the call is relevant but is a cost of securely generated numbers, and should not be bypassed for speed reasons. Please don’t ever use math/rand for something that needs to be secure.

Thanks to Tim Kagle in the Gophers slack and dchapes over on reddit for calling me out for omitting this discussion.

Scott Mansfield

Locking in crypto/rand

A Bit of Benchmarking

One Level Deeper

Another Lock Revealed

Bonus Section: Which Should You Use?