The Hidden Dangers of Default Rand
Jan 5, 2016 · 8 minute read · CommentsGoPerformance
This post is based on a real-world problem that I faced when I was developing a load generator
program for work. I looked through my code carefully to make sure that there weren’t any bottlenecks
in the design or bugs in the implementation, but didn’t find anything explicitly wrong. I did some
profiling after that and found something very interesting: deep down in the rand
package was a
lock that was taking up the majority of the time in the application. The following three programs
show a distilled version of the start, middle, and end states that my program went through while
working with and around the rand
package.
Program Design
The program, in all three cases, has the same design. There is a pool of producer goroutines generating random strings that are then passed through a single channel. On the other end of the channel is a pool of consumer goroutines that consume the random strings. In the real world application there was a network call using the strings, but in these test programs the string length is used to increment a counter. CPU profiling is on for each program the whole time, which slows it down. Since we’re looking only at relative performance, this doesn’t matter.
Getting the SVGs
First, a quick note about the SVGs below: I find it easiest to analyze the time spent in the code
using the graph generated by the web
command using the pprof
tool. Below is an example session
in the pprof tool. You can read more about the pprof
tool at the golang blog.
$ go tool pprof rand_default rand_default.prof
Entering interactive mode (type "help" for commands)
(pprof) web
Unfortunately, the SVGs don’t work well in <img>
tags out of the box, so I had to open then up in
a text editor and fiddle with the width and height properties to make them work properly in a
browser. (In case you’re wondering, I made them fixed and guessed at values until the graph showed
in its entirety.) The text might be a bit small, but I’ve included links to them below each one, and
they come with a built-in pan-and-zoom functionality so you can see them better.
Initial code
Below is the entirety of the example program I’ve written to
demonstrate the issues with just using the default rand
methods.
package main
import "fmt"
import "math/rand"
import "os"
import "runtime/pprof"
import "sync"
import "time"
var letters = []byte("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
func randData(n int) []byte {
b := make([]byte, n)
for i := range b {
b[i] = letters[rand.Intn(len(letters))]
}
return b
}
const numRuns = 100
func main() {
start := time.Now()
for i := 0; i < numRuns; i++ {
do()
}
total := time.Since(start)
perRun := float64(total) / numRuns
fmt.Printf("Time per run: %fns\n", perRun)
// Now generate the pprof data for a single run
f, err := os.Create("rand_default.prof")
if err != nil {
panic(err.Error())
}
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
do()
}
const numRoutines = 10
func do() {
start := make(chan struct{})
comm := make(chan []byte)
var read, write sync.WaitGroup
read.Add(numRoutines)
write.Add(numRoutines)
for i := 0; i < numRoutines; i++ {
go func() {
<-start
for j := 1; j < 10000; j++ {
comm <- randData(rand.Intn(j))
}
write.Done()
}()
go func() {
var sum int
<-start
for c := range comm {
sum += len(c)
}
fmt.Println(sum)
read.Done()
}()
}
close(start)
write.Wait()
close(comm)
read.Wait()
}
Running the program on the console gives 13.1715 seconds per iteration:
$ go build rand_default.go
$ ./rand_default
(output elided)
Time per run: 13171548294.990000ns
Below is the web generated by the pprof
tool to show the timing and connectedness of the different
pieces of the program.
Ouch, those timings hurt. The total time spent in math/rand
is 2066.06 seconds, with a whopping
1067.64 seconds spent trying to obtain the underlying lock and 530.70 seconds spent unlocking that
same lock. Only 99.75 seconds was spent actually doing the work we wanted it to do. This means
1598.34 / 2066.22 = 77.36% of the time just in math/rand
and 72.85% of the overall time was spent
on synchronization overhead.
Checking the math/rand
Source Code
Now that we can see the problem in the graph, we should take a look at a bit of the rand
package’s
source code to find the lock.
var globalRand = New(&lockedSource{src: NewSource(1)})
type lockedSource struct {
lk sync.Mutex // <------------------ Found it!
src Source
}
func (r *lockedSource) Int63() (n int64) {
r.lk.Lock() // <----------------------------------
n = r.src.Int63()
r.lk.Unlock() // <----------------------------------
return
}
func (r *lockedSource) Seed(seed int64) {
r.lk.Lock() // <----------------------------------
r.src.Seed(seed)
r.lk.Unlock() // <----------------------------------
}
The math/rand
package makes it easy to get started with random numbers. The globalRand
variable
is used in the exposed package methods, which all pass through to it. Behind globalRand
(and thus
the generic rand
methods) is a locked pseudo-random number generator (RNG). This makes sense
because the generator generates one number at a time,
which then becomes the basis for the next number. This can’t be safely accessed concurrently, so a
lock is needed to serialize access. I should point out that this is not a design flaw, but is a way
of doing the right thing.
An Easy Optimization
Fortunately for me, the kind people in the Gophers Slack team directed me towards the
rand.New()
function, which takes a rand.Source
as an argument. That can be obtained from rand.NewSource()
which takes an int64
seed and returns an RNG the same
as the default package-level one but without the locking wrapper. Applying this new-found knowledge
only takes a few lines of code difference. I’ll show just the updated bits here.
The randData
function takes a rand.Rand
instance now to use when generating random strings:
func randData(r *rand.Rand, n int) []byte {
b := make([]byte, n)
for i := range b {
b[i] = letters[r.Intn(len(letters))]
}
return b
}
And each producer goroutine now creates its own instance of rand.Rand
when it starts so it can be
passed to the randData function each time:
go func() {
r := rand.New(rand.NewSource(time.Now().Unix()))
<-start
for j := 1; j < 10000; j++ {
comm <- randData(r, r.Intn(j))
}
write.Done()
}()
Running the optimized program gives us a time per iteration of 2.2668 seconds:
$ go build rand_optimized.go
$ ./rand_optimized
(output elided)
Time per run: 2266835998.330000ns
That’s quite an improvement. It’s taken the time down to 17.21% of the previous time. That’s a 5.8x improvement. If we take a look at how much time was spent doing locking before, we can see that the rest of the program was taking 27.15% of the time. This means getting rid of the synchronization overhead and using separate instances actually increased our program’s speed even more than simply getting rid of the locks. My educated guess is that true parallelism allowed the program to run multiple RNGs at the same time on different cores.
From the pprof
graph of the optimized version, we can see that the majority of the time is still
spent generating random numbers. This isn’t unexpected, but that’s still a lot of time spent
overall. The RNGs still take up 335.14 / 429.49 = 78.03% of the program’s run time. What if we could
eliminate most of that as well?
A Step Further
Let’s take things a step further. What if instead of entirely random strings I used a random length and sliced a premade string? This relaxation allows me to generate a random string once and just select a subset that is needed each time. This saves memory, gc pressure, and a lot of computation. The changes to do this are small as well.
I added an init function and a package variable:
var rData []byte
func init() {
r := rand.New(rand.NewSource(time.Now().Unix()))
rData = randData(r, 10000)
}
And in the producer goroutine, I changed the call to randData
to a slice operation on the
pre-generated random data:
go func() {
r := rand.New(rand.NewSource(time.Now().Unix()))
<-start
for j := 1; j < 10000; j++ {
comm <- rData[:r.Intn(j)]
}
write.Done()
}()
Running this extra-optimized code gives an average time per iteration of 0.0252 seconds:
$ go build rand_extraopt.go
$ ./rand_extraopt
(output elided)
Time per run: 25190140.240000ns
Now the single RNG takes 0.32 / 2.81 = 11.39% of the run time, but the cose is paid once. It’s simply a ratio now based on how many iterations are run in the rest of the code. Compared to the initial version, this code takes 0.1912% of the time, which makes it a 522.88x improvement over the original code. Most of the time, at this point, is actually spent trying to communicate over the channel that is the pipe between producers and consumers. In the real-world application, this meant that the time was spent hammering the target, not overheating the server it was running on.
Profiling Reveals Many Surprises
The profiling tools in Go were an engima to me before this process of debugging a slow program. In
the process, I found a huge time sink in the math/rand
locking, but also found some other
surprises as well. I’d encourage anyone to profile their Go programs (once they are functioning
properly) to see if they can’t find something that could be easily optimized and save them some
time. The rand
package is just one example I use now to help understand the value of doing so. If
you have any times when profiling has helped you, let me know in the comments. I’d be very
interested to hear some more optimization war stories.