Profiling in Go: A Practical Guide
By Noam YadgarProfiling is an underrated skill among software-engineers and it’s often overlooked by even very skilled developers. Profiling a program, is essentially measuring CPU and memory usage to spot bottlenecks, memory leaks, and other performance issues. Knowing how profile a program and interpret the result can direct us to where exactly our program is suffering from poor performance, and focus our efforts on optimizing those specific parts. In this article, we’ll be profiling a real-world Go program. We’ll learn how to interpret the results, draw conclusions, and optimize the program accordingly.
The program we’ll be profiling
What’s the hottest place on Earth? What is the coldest? Our program interacts with the Global Temperature dataset, which contains a hundred years of monthly average temperatures of about 85K coordinates around the globe. So in terms of records: \(85,000 \times 12 \times 100 = 102,000,000\) records (about 1.1Gb). Our program reads this dataset and outputs a table of all coordinates with their total average temperature over the years.
Let’s look at the data first:
ls
air_temp.1900 air_temp.1911 air_temp.1922 air_temp.1933 air_temp.1944 air_temp.1955 air_temp.1966 air_temp.1977
air_temp.1901 air_temp.1912 air_temp.1923 air_temp.1934 air_temp.1945 air_temp.1956 air_temp.1967 air_temp.1978
air_temp.1902 air_temp.1913 air_temp.1924 air_temp.1935 air_temp.1946 air_temp.1957 air_temp.1968 air_temp.1979
air_temp.1903 air_temp.1914 air_temp.1925 air_temp.1936 air_temp.1947 air_temp.1958 air_temp.1969 air_temp.1980 ...
Each file contains the monthly average temperatures of the year per coordinates.
The content of each file is as follows (I will take the first 5 lines of one year):
head -n 5 air_temp.1990
-179.750 71.250 -24.7 -29.9 -24.4 -12.5 -2.6 0.4 2.2 4.2 -0.9 -5.9 -15.4 -25.5
-179.750 68.750 -29.9 -32.5 -22.5 -10.9 -3.7 0.7 2.8 3.6 -1.1 -8.7 -17.1 -27.5
-179.750 68.250 -31.0 -33.7 -23.6 -11.9 -4.5 0.9 3.3 3.6 -1.7 -9.7 -18.3 -28.2
-179.750 67.750 -30.1 -33.9 -23.3 -11.5 -3.2 3.6 6.8 6.0 -0.2 -9.0 -18.6 -27.0
-179.750 67.250 -32.0 -37.9 -26.8 -14.5 -5.2 3.6 7.2 5.7 -1.9 -11.7 -22.0 -29.4
The first two columns are the coordinates, and the rest are the monthly average temperatures.
Here is our full program:
1package main
2
3import (
4 "encoding/csv"
5 "flag"
6 "fmt"
7 "io"
8 "os"
9 "strconv"
10 "strings"
11)
12
13type record struct {
14 lon, lat float64
15 temps [12]float64
16}
17
18func main() {
19 files, _ := os.ReadDir("data")
20
21 rawData := make(map[string][]byte)
22 for _, file := range files {
23 f, _ := os.Open("data/" + file.Name())
24 rawData[file.Name()], _ = io.ReadAll(f)
25 f.Close()
26 }
27
28 parsed := parseData(rawData)
29 output := make(map[string][]float64)
30 for _, v := range parsed {
31 for _, r := range v {
32 key := fmt.Sprintf("%.3f:%.3f", r.lat, r.lon)
33 output[key] = append(output[key], r.temps[:]...)
34 }
35 }
36
37 outputFile, _ := os.Create("output.csv")
38 w := csv.NewWriter(outputFile)
39 defer w.Flush()
40 defer outputFile.Close()
41
42 for coordinates, yearlyTemps := range output {
43 row := []string{coordinates, fmt.Sprintf("%.2f", average(yearlyTemps))}
44 w.Write(row)
45 }
46}
47
48func parseData(input map[string][]byte) map[string][]record {
49 m := make(map[string][]record)
50 for filename, v := range input {
51 lines := strings.Split(string(v), "\n")
52 for _, line := range lines {
53 seg := strings.Fields(line)
54 if len(seg) != 14 {
55 continue
56 }
57 lon, _ := strconv.ParseFloat(seg[0], 64)
58 lat, _ := strconv.ParseFloat(seg[1], 64)
59 temps := [12]float64{}
60 for i := 2; i < 14; i++ {
61 t, _ := strconv.ParseFloat(seg[i], 64)
62 temps[i-2] = t
63 }
64
65 m[filename] = append(m[filename], record{lon, lat, temps})
66 }
67 }
68
69 return m
70}
71
72func average(input []float64) float64 {
73 var sum float64
74 for _, v := range input {
75 sum += v
76 }
77 return sum / float64(len(input))
78}
If we’ll run this program, it will take few seconds and output a file, named output.csv
with the results.
On my machine it took about 14 seconds. CPU: Intel i7-9700 (8) @ 4.700GHz
Here are the first few lines of the output file (sorted from highest to lowest):
column -t -s, output.csv | sort -k2 -nr | head -n 5
-9.750:-75.250 34.99
-10.250:-75.250 33.79
-10.250:-74.750 32.98
19.250:41.250 31.83
-7.250:31.750 31.51
Profiling with runtime/pprof
Go is equipped with an incredible profiling tool called runtime/pprof
. With this library, we can easily profile our program’s CPU and memory usage.
Profiling CPU usage
CPU profiling is a time-based measurement. We’d like to know, how long does it take for each function to execute. When we perform CPU profiling with runtime/pprof
, our program will be briefly paused at regular intervals (100/per second) to collect samples. The more samples we collect, the more accurate the result will be.
Since our program is a single short-living procedure with a defined end, we can profile the entire program. Here is how we can do it:
1import (
2 "flag"
3 "os"
4 "runtime/pprof"
5 //... other imports
6)
7
8var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to this file")
9
10func main() {
11 flag.Parse()
12 if *cpuprofile != "" {
13 cpuf, _ := os.Create(*cpuprofile)
14 pprof.StartCPUProfile(cpuf)
15 defer cpuf.Close()
16 defer pprof.StopCPUProfile()
17 }
18 // ...
19}
When we run our program with the -cpuprofile
flag, it will start CPU profiling and write the results to the specified file.
Profiling memory usage
While CPU profiling is integrating the collection of samples over a time period, memory profiling is a snapshot of the memory state at a specific moment. If possible, we should find a strategic point in our program to take the snapshot. In our case, it looks like we can get a lot of information if we place it just before we write rows to the output file. Let’s add it as well:
1// ...
2var (
3 cpuprofile = flag.String("cpuprofile", "", "write cpu profile to this file")
4 memprofile = flag.String("memprofile", "", "write memory profile to this file")
5)
6
7func main() {
8 flag.Parse()
9 if *cpuprofile != "" {
10 cpuf, _ := os.Create(*cpuprofile)
11 pprof.StartCPUProfile(cpuf)
12 defer cpuf.Close()
13 defer pprof.StopCPUProfile()
14 }
15
16 // ... some code
17
18 if *memprofile != "" {
19 memf, _ := os.Create(*memprofile)
20 pprof.WriteHeapProfile(memf)
21 memf.Close()
22 }
23
24 for coordinates, yearlyTemps := range aggregated {
25 row := []string{coordinates, fmt.Sprintf("%.2f", average(yearlyTemps))}
26 w.Write(row)
27 }
28}
Analyzing the CPU profile
We can now build our program and run it with the profiling flags
go build -o bin main.go
./bin -cpuprofile=cpu.prof
./bin -memprofile=mem.prof
The reason I run the program twice is that I want to profile CPU and memory separately. If we run the program with both flags, the CPU profile will be affected by the memory profiling and vice versa.
cpu.prof
and mem.prof
. Let’s analyze the results.
Running go tool pprof
We can launch the Go pprof
tool and analyze the results, let’s start with the CPU profile:
go tool pprof bin cpu.prof
File: bin
Type: cpu
Time: Apr 7, 2024 at 5:24pm (IDT)
Duration: 14.03s, Total samples = 14.05s (100.11%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)
We are now, at the pprof
CLI, and your first go to action will be to check topN
.
Exploring topN
Let’s see the top 5 most CPU expensive functions:
(pprof) top5
Showing nodes accounting for 7360ms, 52.38% of 14050ms total
Dropped 104 nodes (cum <= 70.25ms)
Showing top 5 nodes out of 72
flat flat% sum% cum cum%
2640ms 18.79% 18.79% 2640ms 18.79% runtime.memmove
1820ms 12.95% 31.74% 2560ms 18.22% strings.Fields
1600ms 11.39% 43.13% 1600ms 11.39% strconv.readFloat
800ms 5.69% 48.83% 800ms 5.69% strconv.(*decimal).Assign
500ms 3.56% 52.38% 500ms 3.56% strconv.atof64exact
- Flat - Is the time/percentage spent by the described function call
- Cum (cumulative) - Is the time/percentage spent by the functions being called by the described function (self included).
Looks like runtime.memmove
, strings.Fields
and strconv.readFloat
are pretty expensive.
To get a better understanding, we can sort top5
by cum
and see who’s calling those functions
(pprof) top5 -cum
Showing nodes accounting for 0.47s, 3.35% of 14.05s total
Dropped 104 nodes (cum <= 0.07s)
Showing top 5 nodes out of 72
flat flat% sum% cum cum%
0.22s 1.57% 1.57% 13.81s 98.29% main.main
0 0% 1.57% 13.81s 98.29% runtime.main
0.11s 0.78% 2.35% 7.51s 53.45% main.parseData
0.02s 0.14% 2.49% 3.28s 23.35% fmt.Sprintf
0.12s 0.85% 3.35% 3.17s 22.56% strconv.ParseFloat
Notice that runtime.main
is taking 98.29% of cumulative time but 0% flat. It makes sense, since everything is being called by runtime.main
, however, the function itself doesn’t take any time from the CPU. We can also see functions that we’re calling in our program: main.parseData
is our own function, and it takes 7.51 seconds from the entire process. Also, fmt.Sprintf
and strconv.ParseFloat
are functions that we’re calling in the process.
Exploring the graph
While topN
is great, and to be honest, in most cases it will be enough to draw conclusion from, the pprof
tool holds some more powerful diagnostic tools. To better understand the call stack, we can look at an interactive graph. There are a few ways for generating this graph - my personal favorite is to spin up a server that bundles all of the pprof
commends in a beautiful GUI in the browser.
You might need to install graphviz
go tool pprof -http localhost:6060 bin cpu.prof
Okay, that’s a little bit hard to follow, we can use the search bar at the top of this page to narrow down the graph.
Let’s search for the main.parseData
and see why it takes \(7.51\) seconds
It looks like main.parseData
is responsible for most of the CPU consumption. We’ll keep that in mind when we will optimize the code.
Analyzing the memory profile
Once again, topN
is your go to command, only in this case, we’re looking at a memory snapshot of the moment when we called pprof.WriteHeapProfile(memf)
go tool pprof bin mem.prof
File: bin
Type: inuse_space
Time: Apr 7, 2024 at 8:08pm (IDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 1741.73MB, 99.91% of 1743.23MB total
Dropped 1 node (cum <= 8.72MB)
flat flat% sum% cum cum%
1194.12MB 68.50% 68.50% 1194.12MB 68.50% main.parseData
547.61MB 31.41% 99.91% 1743.23MB 100% main.main
0 0% 99.91% 1743.23MB 100% runtime.main
Wow! The same main.parseData
is taking 1.2Gb of memory! It also seems like main.main
itself, is taking about 0.5Gb. So our CPU issues are being overshadowed by our much bigger memory issues. In fact, the CPU issues are likely connected to the large size of main.parseData
, since lots of memory can result in lots of memmove
(moving memory), malloc
(memory allocations) and mallocgc
(garbage collection) operations.
Exploring list
(source code analysis)
The pprof
tool provides such a deep analysis that we can even see the source code of the functions that are consuming the most memory. Let’s see the source code of main.parseData
using the list
command:
We can even go as deep as the underlying assembly code, but that’s a bit out of scope for this article.
(pprof) list parseData
Total: 1.70GB
ROUTINE ======================== main.parseData in /home/nyadgar/temp_analysis/main.go
1.17GB 1.17GB (flat, cum) 68.50% of Total
. . 68:func parseData(input map[string][]byte) map[string][]record {
. . 69: m := make(map[string][]record)
. . 70: for filename, v := range input {
. . 71: lines := strings.Split(string(v), "\n")
. . 72: for _, line := range lines {
. . 73: seg := strings.Fields(line)
. . 74: if len(seg) != 14 {
. . 75: continue
. . 76: }
. . 77: lon, _ := strconv.ParseFloat(seg[0], 64)
. . 78: lat, _ := strconv.ParseFloat(seg[1], 64)
. . 79: temps := [12]float64{}
. . 80: for i := 2; i < 14; i++ {
. . 81: t, _ := strconv.ParseFloat(seg[i], 64)
. . 82: temps[i-2] = t
. . 83: }
. . 84:
1.17GB 1.17GB 85: m[filename] = append(m[filename], record{lon, lat, temps})
. . 86: }
. . 87: }
. . 88:
. . 89: return m
. . 90:}
Well, that’s not surprising. We’re reading the entire dataset into memory and parsing it to a map, This is a good starting point for optimization.
Let’s also look at the main.main
function:
(pprof) list main.main
Total: 1.70GB
ROUTINE ======================== main.main in /home/nyadgar/temp_analysis/main.go
547.61MB 1.70GB (flat, cum) 100% of Total
. . 24:func main() {
. . 25: flag.Parse()
. . 26: if *cpuprofile != "" {
. . 27: cpuf, _ := os.Create(*cpuprofile)
. . 28: pprof.StartCPUProfile(cpuf)
. . 29: defer cpuf.Close()
. . 30: defer pprof.StopCPUProfile()
. . 31: }
. . 32:
. . 33: files, _ := os.ReadDir("data")
. . 34:
. . 35: rawData := make(map[string][]byte)
. . 36: for _, file := range files {
. . 37: f, _ := os.Open("data/" + file.Name())
. . 38: rawData[file.Name()], _ = io.ReadAll(f)
. . 39: f.Close()
. . 40: }
. . 41:
. 1.17GB 42: parsed := parseData(rawData)
. . 43: output := make(map[string][]float64)
. . 44: for _, v := range parsed {
. . 45: for _, r := range v {
. 1.50MB 46: key := fmt.Sprintf("%.3f:%.3f", r.lat, r.lon)
547.61MB 547.61MB 47: output[key] = append(output[key], r.temps[:]...)
We’re committing a double crime here, by reading the entire dataset into memory and parsing it to a map, and then, we’re creating another new map with pretty much the same data, organized a bit differently.
Optimizing the program
After analyzing the results, we can draw some conclusions:
- We should start by optimizing the
main.main
andmain.parseData
functions. Instead of reading the entire dataset into memory, we can read it line by line and parse it on the fly. This way, we can save a lot of memory and CPU time. - The
struct record
can be made simpler, by having its coordinates as a single string, and the temperatures as a singlefloat32
average. - We can remove unnecessary duplication of data by skipping some unnecessary steps in the program.
- We should search for better alternatives for
strconv.ParseFloat
and (if possible)strings.Fields
functions. - There are two
strconv.ParseFloat
calls we can remove frommain.parseData
(can you spot them?). - There are places where we can avoid
fmt.Sprintf
, which is a relatively expensive function (according to the profiling results).
Let’s look at the optimized code:
1package main
2
3import (
4 "bufio"
5 "encoding/csv"
6 "flag"
7 "fmt"
8 "os"
9 "runtime/pprof"
10 "strings"
11
12 "github.com/sugawarayuuta/refloat"
13)
14
15type record struct {
16 coords string
17 avg float32
18}
19
20var (
21 cpuprofile = flag.String("cpuprofile", "", "write cpu profile to this file")
22 memprofile = flag.String("memprofile", "", "write memory profile to this file")
23)
24
25func main() {
26 flag.Parse()
27 if *cpuprofile != "" {
28 cpuf, _ := os.Create(*cpuprofile)
29 pprof.StartCPUProfile(cpuf)
30 defer cpuf.Close()
31 defer pprof.StopCPUProfile()
32 }
33
34 files, _ := os.ReadDir("data")
35
36 aggregated := make(map[string][]float32)
37 for _, file := range files {
38 f, _ := os.Open("data/" + file.Name())
39 scanner := bufio.NewScanner(f)
40 for scanner.Scan() { // reading line by line
41 if rec := parseData(scanner.Text()); rec.coords != "" {
42 aggregated[rec.coords] = append(aggregated[rec.coords], rec.avg)
43 }
44 }
45
46 f.Close()
47 }
48
49 outputFile, _ := os.Create("output.csv")
50 w := csv.NewWriter(outputFile)
51 defer w.Flush()
52 defer outputFile.Close()
53
54 if *memprofile != "" {
55 memf, _ := os.Create(*memprofile)
56 pprof.WriteHeapProfile(memf)
57 memf.Close()
58 }
59
60 for coordinates, yearlyTemps := range aggregated {
61 row := []string{coordinates, fmt.Sprintf("%.2f", average(yearlyTemps))}
62 w.Write(row)
63 }
64}
65
66func parseData(line string) record {
67 seg := strings.Fields(line) // no alternative found for strings.Fields
68 if len(seg) != 14 {
69 return record{}
70 }
71 var temp float32 // we know there are 12 months, we don't need to store all of them as a slice for avg
72 for i := 2; i < 14; i++ {
73 t, _ := refloat.ParseFloat(seg[i], 32) // using refloat instead of strconv.ParseFloat
74 temp += float32(t)
75 }
76
77 return record{coords: seg[1] + ":" + seg[0], avg: temp / 12} // much simpler record
78}
79
80func average(input []float32) float32 {
81 var sum float32
82 for _, v := range input {
83 sum += v
84 }
85 return sum / float32(len(input))
86}
If we run this optimized version, we should see a significant improvement in both CPU and memory usage.
go build -o bin main.go
./bin -memprofile mem.prof
go tool pprof bin mem.prof
File: bin
Type: inuse_space
Time: Apr 7, 2024 at 8:55pm (IDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 57.10MB, 100% of 57.10MB total
flat flat% sum% cum cum%
55.60MB 97.37% 97.37% 57.10MB 100% main.main
0.50MB 0.88% 98.25% 0.50MB 0.88% strings.Fields
0.50MB 0.88% 99.12% 0.50MB 0.88% bufio.(*Scanner).Text (inline)
0.50MB 0.88% 100% 1MB 1.75% main.parseData
0 0% 100% 57.10MB 100% runtime.main
Amazing! We’ve managed to reduce the memory consumption from 1.7Gb to 57.1Mb.
Let’s also check the CPU profile:
(pprof) top1 -cum
Showing nodes accounting for 140ms, 2.05% of 6830ms total
Dropped 103 nodes (cum <= 34.15ms)
Showing top 1 nodes out of 67
flat flat% sum% cum cum%
140ms 2.05% 2.05% 5970ms 87.41% main.main
Looks great, the CPU consumption is now down to around 7 seconds from 14 seconds!
Profiling a running server
Up until now, we’ve been profiling a single short-living program. But what if we want to profile an application such as a server? It would be a bit more complicated to pick points for profiling and trigger the profiling process. Also, we might be running the application in an environment where we don’t have access to files that are being created by the profiling process.
Using net/http/pprof
The built-in net/http/pprof
package provides an HTTP interface for profiling Go applications. It uses the same pprof
tool we’ve been using so far, but instead of writing the results to a file, it serves them over HTTP by automatically registering handlers to a HTTP server. All you need to do is to import the package and spin-up a server.
1import _ "net/http/pprof"
2
3func main() {
4 go func() {
5 log.Println(http.ListenAndServe("localhost:6060", nil))
6 }()
7
8 // your application code ...
9}
To profile the heap and CPU usage, we can use the following endpoints:
http://localhost:6060/debug/pprof/heap
http://localhost:6060/debug/pprof/profile?seconds=15
In terms of profiling, these endpoints are equivalent to the WriteHeapProfile
and StartCPUProfile
functions we’ve been using so far. Of course we can’t precisely orchestrate the profiling process as we did before, but rather trying to catch a moment where the server is under heavy load.
It worth to mention that there are other types of profiling that are more suited to distributed, server-like apps. By using the net/http/pprof
package we can easily access:
http://localhost:6060/debug/pprof/block
- goroutine blocking profile, to show where goroutines are blocking. (You needruntime.SetBlockProfileRate
to enable it)http://localhost:6060/debug/pprof/mutex
- reports locks (You needruntime.SetMutexProfileFraction
to enable it)http://localhost:6060/debug/pprof/trace?secods=10
- provides minimal tracing profile which can be accessed usinggo tool tracer [tracer file]
To see the full list and get familiar with the whole Go diagnostics ecosystem, you can read the go article about diagnostics .
Summary
One of the hardest tasks in software engineering, is detecting non-obvious, hidden issues. While you will find an error log in a place where you expect a potential error to occur, or look at a histogram of a memory-leaking process, a good profile will top that by being able to show you the exact lines of code, where your program is misbehaving. Latched on our diagnostics tool-belt are debugging, logging, tracing, and metrics. It’s time for you give profiling an honorable place in this tool-belt as well.
Oh, by the way. Here the coldest place on Earth, according to the dataset: \((-79.750,43.750)\) With average of \(-57.55\degree C\) (somewhere in the middle of Antarctica).