Measuring cost of spawning Goroutines
Developers who learn or start with Golang are taught to treat goroutines as a very cheap version of threads. The minimum cost of spawning a goroutine has been decreasing with different versions and is now currently at 2 kB for Golang 1.19. That makes it very easy and efficient to spawn goroutines for smaller tasks. Most of the standard libraries in Golang like net/http
spawn a goroutine for every incoming request and the response is returned in the same goroutine and it scales well compared to other interpreted languages.
This article will try to investigate if the cost of spawning a goroutine for every request or task is really the best way to maintain concurrent applications.
We will create a simple program where we calculate the sum of 2 numbers with different frequencies.
In the first test, we will spawn a goroutine for every request to calculate the sum.
In the second test, we will spawn a set of goroutines when the process starts up, and when the requests start coming in, we will allocate goroutines in a round robin schedule where the goroutines are listening on a single channel.
In the third test, we will have a similar test as the second test, except that we will be assigning one channel per goroutine.
In the below tests, we will be calculating the overall time taken for the logic to complete.
The test is being run on a Apple M1 Max Pro with 12 cores and 32 GB of memory.
func calculateSum(outputCh chan int, a, b int) {
result := 0
result = a + b
outputCh <- result
return
}
func main() {
var (
frequency int
)
outputCh := make(chan int, OutputChSize)
flag.Parse()
frequency, \_ = strconv.Atoi(flag.Args()\[0\])
returnedCount := 0
log.Println("Frequency selected", frequency)
startTime := time.Now()
go func() {
for currIdx := 0; currIdx < frequency; currIdx++ {
go calculateSum(outputCh, 4, currIdx)
}
}()
for {
<-outputCh
returnedCount += 1
if returnedCount == frequency {
break
}
}
timeTaken := time.Since(startTime)
log.Println("Total Queries processed:", returnedCount, "in", timeTaken.Milliseconds(), "ms")
}
Comparing the test results#
Test results for spawning goroutine on every request
- 1000000000 (One Billion) requests — 435440 ms (7.25 minutes)
- 100000000 (One Hundred Million) requests — 43319 ms
- 10000000 (Ten Million) requests — 4232 ms
- 1000000 (One Million) requests — 444 ms
Test results for spawning goroutine on process start with one input channel for concurrency 10
- 1000000000 (One Billion) requests — 168123 ms (2.8 minutes)
- 100000000 (One Hundred Million) requests — 16644 ms
- 10000000 (Ten Million) requests — 1738 ms
- 1000000 (One Million) requests — 167 ms
Based on the comparison above for different data points, the process of starting a goroutine pool when the process starts and then assigning requests to the goroutines is more than 2 times faster than the approach of creating a goroutine on the fly.
I want to make it faster. Let’s add more goroutines#
Most assumptions are that adding more concurrency or parallelism to a process makes it faster. Let’s try that out in this section.
In the above example, we tried the benchmark with a concurrency of 10 goroutines. We will try the benchmark again with a concurrency of 100 and 1000.
The results may surprise you :)
Test results for spawning goroutine on process start with one input channel for concurrency 100
- 1000000000 (One Billion) requests — 253336 ms (4.2 minutes)
- 100000000 (One Hundred Million) requests — 25689 ms
- 10000000 (Ten Million) requests — 2545 ms
- 1000000 (One Million) requests — 273 ms
Test results for spawning goroutine on process start with one input channel for concurrency 1000
- 1000000000 (One Billion) requests — 329467 ms
- 100000000 (One Hundred Million) requests — 32022 ms
- 10000000 (Ten Million) requests — 3165 ms
- 1000000 (One Million) requests — 386 ms
The results show that as we increase the concurrency from 10 to 1000, the total throughput reduces as we add more goroutines.
Why is that? Shouldn’t adding more goroutines add more throughput?
The answer is that the system has a finite set of resources. If a process with 10 goroutines is able to leverage the complete CPU utilisation, then the chances are the adding concurrency won’t increase the overall throughput of the process.
Having more goroutines than the concurrency capability of the machine leads to an additional overhead management for the Golang scheduler. Communication objects like channels use synchronisation primitives like futexes (fast mutex) to ensure no race conditions occur when multiple goroutines are writing to it.
Another thing to note is that the program to add numbers is CPU sensitive rather than IO sensitive. The numbers for this section may be completely different if we try it out with a IO sensitive application like a http server.
What other parameters can we change for the test? In the above approach of spawning goroutines on process start, all the goroutines are listening to a single goroutine. Let’s see if having a single channel per goroutine helps.
Test results for spawning goroutine on process start with multiple input channels for concurrency 100
- 1000000000 (One Billion) requests — NA ms
- 100000000 (One Hundred Million) requests — 23016 ms
- 10000000 (Ten Million) requests — 2138 ms
- 1000000 (One Million) requests — 228 ms
Test results for spawning goroutine on process start with multiple input channels for concurrency 1000
- 1000000000 (One Billion) requests — NA ms
- 100000000 (One Hundred Million) requests — 27504 ms
- 10000000 (Ten Million) requests — 2920 ms
- 1000000 (One Million) requests — 310 ms
As we can see from the results, the test with concurrency of 1000 has increased throughput compared to previous tests, but not significantly as well.
Changing concurrency from 1000 to 100 didn’t result in much of a reduction.
Conclusion#
Always try to pre allocate resources for computation regardless of how cheap the concurrency patterns are. This may lead to management overhead of the goroutines where issues of resource starvation may happen. However, that usually happens when the resource estimation has not been done properly and can also affect dynamic allocation in a different manner.
I hope you liked the article. Please let me know if you have any queries regarding the article. Happy reading!!