Profiling in Go: A Practical Guide

By Noam Yadgar at April 9, 2024

#go #runtime #cpu #memory #pprof #diagnostics #software-engineering #optimization #profile

Profiling is an underrated skill among software-engineers and it’s often overlooked by even very skilled developers. Profiling a program, is essentially measuring CPU and memory usage to spot bottlenecks, memory leaks, and other performance issues. Knowing how profile a program and interpret the result can direct us to where exactly our program is suffering from poor performance, and focus our efforts on optimizing those specific parts. In this article, we’ll be profiling a real-world Go program. We’ll learn how to interpret the results, draw conclusions, and optimize the program accordingly.

The program we’ll be profiling

What’s the hottest place on Earth? What is the coldest? Our program interacts with the Global Temperature dataset, which contains a hundred years of monthly average temperatures of about 85K coordinates around the globe. So in terms of records: \(85,000 \times 12 \times 100 = 102,000,000\) records (about 1.1Gb). Our program reads this dataset and outputs a table of all coordinates with their total average temperature over the years.

Let’s look at the data first:

ls
air_temp.1900  air_temp.1911  air_temp.1922  air_temp.1933  air_temp.1944  air_temp.1955  air_temp.1966  air_temp.1977
air_temp.1901  air_temp.1912  air_temp.1923  air_temp.1934  air_temp.1945  air_temp.1956  air_temp.1967  air_temp.1978
air_temp.1902  air_temp.1913  air_temp.1924  air_temp.1935  air_temp.1946  air_temp.1957  air_temp.1968  air_temp.1979
air_temp.1903  air_temp.1914  air_temp.1925  air_temp.1936  air_temp.1947  air_temp.1958  air_temp.1969  air_temp.1980 ...

Each file contains the monthly average temperatures of the year per coordinates.

The content of each file is as follows (I will take the first 5 lines of one year):

head -n 5 air_temp.1990
-179.750  71.250   -24.7   -29.9   -24.4   -12.5    -2.6     0.4     2.2     4.2    -0.9    -5.9   -15.4   -25.5
-179.750  68.750   -29.9   -32.5   -22.5   -10.9    -3.7     0.7     2.8     3.6    -1.1    -8.7   -17.1   -27.5
-179.750  68.250   -31.0   -33.7   -23.6   -11.9    -4.5     0.9     3.3     3.6    -1.7    -9.7   -18.3   -28.2
-179.750  67.750   -30.1   -33.9   -23.3   -11.5    -3.2     3.6     6.8     6.0    -0.2    -9.0   -18.6   -27.0
-179.750  67.250   -32.0   -37.9   -26.8   -14.5    -5.2     3.6     7.2     5.7    -1.9   -11.7   -22.0   -29.4

The first two columns are the coordinates, and the rest are the monthly average temperatures.

Here is our full program:

package main

import (
	"encoding/csv"
	"flag"
	"fmt"
	"io"
	"os"
	"strconv"
	"strings"
)

type record struct {
	lon, lat float64
	temps    [12]float64
}

func main() {
	files, _ := os.ReadDir("data")

	rawData := make(map[string][]byte)
	for _, file := range files {
		f, _ := os.Open("data/" + file.Name())
		rawData[file.Name()], _ = io.ReadAll(f)
		f.Close()
	}

	parsed := parseData(rawData)
	output := make(map[string][]float64)
	for _, v := range parsed {
		for _, r := range v {
			key := fmt.Sprintf("%.3f:%.3f", r.lat, r.lon)
			output[key] = append(output[key], r.temps[:]...)
		}
	}

	outputFile, _ := os.Create("output.csv")
	w := csv.NewWriter(outputFile)
	defer w.Flush()
	defer outputFile.Close()

	for coordinates, yearlyTemps := range output {
		row := []string{coordinates, fmt.Sprintf("%.2f", average(yearlyTemps))}
		w.Write(row)
	}
}

func parseData(input map[string][]byte) map[string][]record {
	m := make(map[string][]record)
	for filename, v := range input {
		lines := strings.Split(string(v), "\n")
		for _, line := range lines {
			seg := strings.Fields(line)
			if len(seg) != 14 {
				continue
			}
			lon, _ := strconv.ParseFloat(seg[0], 64)
			lat, _ := strconv.ParseFloat(seg[1], 64)
			temps := [12]float64{}
			for i := 2; i < 14; i++ {
				t, _ := strconv.ParseFloat(seg[i], 64)
				temps[i-2] = t
			}

			m[filename] = append(m[filename], record{lon, lat, temps})
		}
	}

	return m
}

func average(input []float64) float64 {
	var sum float64
	for _, v := range input {
		sum += v
	}
	return sum / float64(len(input))
}

If we’ll run this program, it will take few seconds and output a file, named output.csv with the results.

On my machine it took about 14 seconds. CPU: Intel i7-9700 (8) @ 4.700GHz

Here are the first few lines of the output file (sorted from highest to lowest):

column -t -s, output.csv | sort -k2 -nr | head -n 5
-9.750:-75.250    34.99
-10.250:-75.250   33.79
-10.250:-74.750   32.98
19.250:41.250     31.83
-7.250:31.750     31.51

Profiling with `runtime/pprof`

Go is equipped with an incredible profiling tool called runtime/pprof. With this library, we can easily profile our program’s CPU and memory usage.

Profiling CPU usage

CPU profiling is a time-based measurement. We’d like to know, how long does it take for each function to execute. When we perform CPU profiling with runtime/pprof, our program will be briefly paused at regular intervals (100/per second) to collect samples. The more samples we collect, the more accurate the result will be.

Since our program is a single short-living procedure with a defined end, we can profile the entire program. Here is how we can do it:

import (
  "flag"
  "os"
  "runtime/pprof"
  //... other imports
)

var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to this file")

func main() {
	flag.Parse()
	if *cpuprofile != "" {
		cpuf, _ := os.Create(*cpuprofile)
		pprof.StartCPUProfile(cpuf)
		defer cpuf.Close()
		defer pprof.StopCPUProfile()
	}
  // ...
}

When we run our program with the -cpuprofile flag, it will start CPU profiling and write the results to the specified file.

Profiling memory usage

While CPU profiling is integrating the collection of samples over a time period, memory profiling is a snapshot of the memory state at a specific moment. If possible, we should find a strategic point in our program to take the snapshot. In our case, it looks like we can get a lot of information if we place it just before we write rows to the output file. Let’s add it as well:

// ...
var (
	cpuprofile = flag.String("cpuprofile", "", "write cpu profile to this file")
	memprofile = flag.String("memprofile", "", "write memory profile to this file")
)

func main() {
	flag.Parse()
	if *cpuprofile != "" {
		cpuf, _ := os.Create(*cpuprofile)
		pprof.StartCPUProfile(cpuf)
		defer cpuf.Close()
		defer pprof.StopCPUProfile()
	}

    // ... some code

	if *memprofile != "" {
		memf, _ := os.Create(*memprofile)
		pprof.WriteHeapProfile(memf)
		memf.Close()
	}

	for coordinates, yearlyTemps := range aggregated {
		row := []string{coordinates, fmt.Sprintf("%.2f", average(yearlyTemps))}
		w.Write(row)
	}
}

Analyzing the CPU profile

We can now build our program and run it with the profiling flags

go build -o bin main.go
./bin -cpuprofile=cpu.prof
./bin -memprofile=mem.prof

The reason I run the program twice is that I want to profile CPU and memory separately. If we run the program with both flags, the CPU profile will be affected by the memory profiling and vice versa.

By now, we should have two files: cpu.prof and mem.prof. Let’s analyze the results.

Running `go tool pprof`

We can launch the Go pprof tool and analyze the results, let’s start with the CPU profile:

go tool pprof bin cpu.prof
File: bin
Type: cpu
Time: Apr 7, 2024 at 5:24pm (IDT)
Duration: 14.03s, Total samples = 14.05s (100.11%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof)

We are now, at the pprof CLI, and your first go to action will be to check topN.

Exploring `topN`

Let’s see the top 5 most CPU expensive functions:

(pprof) top5
Showing nodes accounting for 7360ms, 52.38% of 14050ms total
Dropped 104 nodes (cum <= 70.25ms)
Showing top 5 nodes out of 72
      flat  flat%   sum%        cum   cum%
    2640ms 18.79% 18.79%     2640ms 18.79%  runtime.memmove
    1820ms 12.95% 31.74%     2560ms 18.22%  strings.Fields
    1600ms 11.39% 43.13%     1600ms 11.39%  strconv.readFloat
     800ms  5.69% 48.83%      800ms  5.69%  strconv.(*decimal).Assign
     500ms  3.56% 52.38%      500ms  3.56%  strconv.atof64exact

Flat - Is the time/percentage spent by the described function call
Cum (cumulative) - Is the time/percentage spent by the functions being called by the described function (self included).

Looks like runtime.memmove, strings.Fields and strconv.readFloat are pretty expensive.

To get a better understanding, we can sort top5 by cum and see who’s calling those functions

(pprof) top5 -cum
Showing nodes accounting for 0.47s, 3.35% of 14.05s total
Dropped 104 nodes (cum <= 0.07s)
Showing top 5 nodes out of 72
      flat  flat%   sum%        cum   cum%
     0.22s  1.57%  1.57%     13.81s 98.29%  main.main
         0     0%  1.57%     13.81s 98.29%  runtime.main
     0.11s  0.78%  2.35%      7.51s 53.45%  main.parseData
     0.02s  0.14%  2.49%      3.28s 23.35%  fmt.Sprintf
     0.12s  0.85%  3.35%      3.17s 22.56%  strconv.ParseFloat

Notice that runtime.main is taking 98.29% of cumulative time but 0% flat. It makes sense, since everything is being called by runtime.main, however, the function itself doesn’t take any time from the CPU. We can also see functions that we’re calling in our program: main.parseData is our own function, and it takes 7.51 seconds from the entire process. Also, fmt.Sprintf and strconv.ParseFloat are functions that we’re calling in the process.

Exploring the graph

While topN is great, and to be honest, in most cases it will be enough to draw conclusion from, the pprof tool holds some more powerful diagnostic tools. To better understand the call stack, we can look at an interactive graph. There are a few ways for generating this graph - my personal favorite is to spin up a server that bundles all of the pprof commends in a beautiful GUI in the browser.

You might need to install graphviz

go tool pprof -http localhost:6060 bin cpu.prof

cpu profile graph

Okay, that’s a little bit hard to follow, we can use the search bar at the top of this page to narrow down the graph.

Let’s search for the main.parseData and see why it takes \(7.51\) seconds

cpu profile narrowed

It looks like main.parseData is responsible for most of the CPU consumption. We’ll keep that in mind when we will optimize the code.

Analyzing the memory profile

Once again, topN is your go to command, only in this case, we’re looking at a memory snapshot of the moment when we called pprof.WriteHeapProfile(memf)

go tool pprof bin mem.prof
File: bin
Type: inuse_space
Time: Apr 7, 2024 at 8:08pm (IDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 1741.73MB, 99.91% of 1743.23MB total
Dropped 1 node (cum <= 8.72MB)
      flat  flat%   sum%        cum   cum%
 1194.12MB 68.50% 68.50%  1194.12MB 68.50%  main.parseData
  547.61MB 31.41% 99.91%  1743.23MB   100%  main.main
         0     0% 99.91%  1743.23MB   100%  runtime.main

Wow! The same main.parseData is taking 1.2Gb of memory! It also seems like main.main itself, is taking about 0.5Gb. So our CPU issues are being overshadowed by our much bigger memory issues. In fact, the CPU issues are likely connected to the large size of main.parseData, since lots of memory can result in lots of memmove (moving memory), malloc (memory allocations) and mallocgc (garbage collection) operations.

Exploring `list` (source code analysis)

The pprof tool provides such a deep analysis that we can even see the source code of the functions that are consuming the most memory. Let’s see the source code of main.parseData using the list command:

We can even go as deep as the underlying assembly code, but that’s a bit out of scope for this article.

(pprof) list parseData
Total: 1.70GB
ROUTINE ======================== main.parseData in /home/nyadgar/temp_analysis/main.go
    1.17GB     1.17GB (flat, cum) 68.50% of Total
         .          .     68:func parseData(input map[string][]byte) map[string][]record {
         .          .     69:	m := make(map[string][]record)
         .          .     70:	for filename, v := range input {
         .          .     71:		lines := strings.Split(string(v), "\n")
         .          .     72:		for _, line := range lines {
         .          .     73:			seg := strings.Fields(line)
         .          .     74:			if len(seg) != 14 {
         .          .     75:				continue
         .          .     76:			}
         .          .     77:			lon, _ := strconv.ParseFloat(seg[0], 64)
         .          .     78:			lat, _ := strconv.ParseFloat(seg[1], 64)
         .          .     79:			temps := [12]float64{}
         .          .     80:			for i := 2; i < 14; i++ {
         .          .     81:				t, _ := strconv.ParseFloat(seg[i], 64)
         .          .     82:				temps[i-2] = t
         .          .     83:			}
         .          .     84:
    1.17GB     1.17GB     85:			m[filename] = append(m[filename], record{lon, lat, temps})
         .          .     86:		}
         .          .     87:	}
         .          .     88:
         .          .     89:	return m
         .          .     90:}

Well, that’s not surprising. We’re reading the entire dataset into memory and parsing it to a map, This is a good starting point for optimization.

Let’s also look at the main.main function:

(pprof) list main.main
Total: 1.70GB
ROUTINE ======================== main.main in /home/nyadgar/temp_analysis/main.go
  547.61MB     1.70GB (flat, cum)   100% of Total
         .          .     24:func main() {
         .          .     25:	flag.Parse()
         .          .     26:	if *cpuprofile != "" {
         .          .     27:		cpuf, _ := os.Create(*cpuprofile)
         .          .     28:		pprof.StartCPUProfile(cpuf)
         .          .     29:		defer cpuf.Close()
         .          .     30:		defer pprof.StopCPUProfile()
         .          .     31:	}
         .          .     32:
         .          .     33:	files, _ := os.ReadDir("data")
         .          .     34:
         .          .     35:	rawData := make(map[string][]byte)
         .          .     36:	for _, file := range files {
         .          .     37:		f, _ := os.Open("data/" + file.Name())
         .          .     38:		rawData[file.Name()], _ = io.ReadAll(f)
         .          .     39:		f.Close()
         .          .     40:	}
         .          .     41:
         .     1.17GB     42:	parsed := parseData(rawData)
         .          .     43:	output := make(map[string][]float64)
         .          .     44:	for _, v := range parsed {
         .          .     45:		for _, r := range v {
         .     1.50MB     46:			key := fmt.Sprintf("%.3f:%.3f", r.lat, r.lon)
  547.61MB   547.61MB     47:			output[key] = append(output[key], r.temps[:]...)

We’re committing a double crime here, by reading the entire dataset into memory and parsing it to a map, and then, we’re creating another new map with pretty much the same data, organized a bit differently.

Optimizing the program

After analyzing the results, we can draw some conclusions:

We should start by optimizing the main.main and main.parseData functions. Instead of reading the entire dataset into memory, we can read it line by line and parse it on the fly. This way, we can save a lot of memory and CPU time.
The struct record can be made simpler, by having its coordinates as a single string, and the temperatures as a single float32 average.
We can remove unnecessary duplication of data by skipping some unnecessary steps in the program.
We should search for better alternatives for strconv.ParseFloat and (if possible) strings.Fields functions.
There are two strconv.ParseFloat calls we can remove from main.parseData (can you spot them?).
There are places where we can avoid fmt.Sprintf, which is a relatively expensive function (according to the profiling results).

Let’s look at the optimized code:

package main

import (
	"bufio"
	"encoding/csv"
	"flag"
	"fmt"
	"os"
	"runtime/pprof"
	"strings"

	"github.com/sugawarayuuta/refloat"
)

type record struct {
	coords string
	avg    float32
}

var (
	cpuprofile = flag.String("cpuprofile", "", "write cpu profile to this file")
	memprofile = flag.String("memprofile", "", "write memory profile to this file")
)

func main() {
	flag.Parse()
	if *cpuprofile != "" {
		cpuf, _ := os.Create(*cpuprofile)
		pprof.StartCPUProfile(cpuf)
		defer cpuf.Close()
		defer pprof.StopCPUProfile()
	}

	files, _ := os.ReadDir("data")

	aggregated := make(map[string][]float32)
	for _, file := range files {
		f, _ := os.Open("data/" + file.Name())
		scanner := bufio.NewScanner(f)
		for scanner.Scan() { // reading line by line
			if rec := parseData(scanner.Text()); rec.coords != "" {
				aggregated[rec.coords] = append(aggregated[rec.coords], rec.avg)
			}
		}

		f.Close()
	}

	outputFile, _ := os.Create("output.csv")
	w := csv.NewWriter(outputFile)
	defer w.Flush()
	defer outputFile.Close()

	if *memprofile != "" {
		memf, _ := os.Create(*memprofile)
		pprof.WriteHeapProfile(memf)
		memf.Close()
	}

	for coordinates, yearlyTemps := range aggregated {
		row := []string{coordinates, fmt.Sprintf("%.2f", average(yearlyTemps))}
		w.Write(row)
	}
}

func parseData(line string) record {
	seg := strings.Fields(line) // no alternative found for strings.Fields
	if len(seg) != 14 {
		return record{}
	}
	var temp float32 // we know there are 12 months, we don't need to store all of them as a slice for avg
	for i := 2; i < 14; i++ {
		t, _ := refloat.ParseFloat(seg[i], 32) // using refloat instead of strconv.ParseFloat
		temp += float32(t)
	}

	return record{coords: seg[1] + ":" + seg[0], avg: temp / 12} // much simpler record
}

func average(input []float32) float32 {
	var sum float32
	for _, v := range input {
		sum += v
	}
	return sum / float32(len(input))
}

If we run this optimized version, we should see a significant improvement in both CPU and memory usage.

go build -o bin main.go
./bin -memprofile mem.prof
go tool pprof bin mem.prof
File: bin
Type: inuse_space
Time: Apr 7, 2024 at 8:55pm (IDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top5
Showing nodes accounting for 57.10MB, 100% of 57.10MB total
      flat  flat%   sum%        cum   cum%
   55.60MB 97.37% 97.37%    57.10MB   100%  main.main
    0.50MB  0.88% 98.25%     0.50MB  0.88%  strings.Fields
    0.50MB  0.88% 99.12%     0.50MB  0.88%  bufio.(*Scanner).Text (inline)
    0.50MB  0.88%   100%        1MB  1.75%  main.parseData
         0     0%   100%    57.10MB   100%  runtime.main

Amazing! We’ve managed to reduce the memory consumption from 1.7Gb to 57.1Mb.

Let’s also check the CPU profile:

(pprof) top1 -cum
Showing nodes accounting for 140ms, 2.05% of 6830ms total
Dropped 103 nodes (cum <= 34.15ms)
Showing top 1 nodes out of 67
      flat  flat%   sum%        cum   cum%
     140ms  2.05%  2.05%     5970ms 87.41%  main.main

Looks great, the CPU consumption is now down to around 7 seconds from 14 seconds!

Profiling a running server

Up until now, we’ve been profiling a single short-living program. But what if we want to profile an application such as a server? It would be a bit more complicated to pick points for profiling and trigger the profiling process. Also, we might be running the application in an environment where we don’t have access to files that are being created by the profiling process.

Using `net/http/pprof`

The built-in net/http/pprof package provides an HTTP interface for profiling Go applications. It uses the same pprof tool we’ve been using so far, but instead of writing the results to a file, it serves them over HTTP by automatically registering handlers to a HTTP server. All you need to do is to import the package and spin-up a server.

import _ "net/http/pprof"

func main() {
  go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
  }()

  // your application code ...
}

To profile the heap and CPU usage, we can use the following endpoints:

http://localhost:6060/debug/pprof/heap
http://localhost:6060/debug/pprof/profile?seconds=15

In terms of profiling, these endpoints are equivalent to the WriteHeapProfile and StartCPUProfile functions we’ve been using so far. Of course we can’t precisely orchestrate the profiling process as we did before, but rather trying to catch a moment where the server is under heavy load.

It worth to mention that there are other types of profiling that are more suited to distributed, server-like apps. By using the net/http/pprof package we can easily access:

http://localhost:6060/debug/pprof/block - goroutine blocking profile, to show where goroutines are blocking. (You need runtime.SetBlockProfileRate to enable it)
http://localhost:6060/debug/pprof/mutex - reports locks (You need runtime.SetMutexProfileFraction to enable it)
http://localhost:6060/debug/pprof/trace?secods=10 - provides minimal tracing profile which can be accessed using go tool tracer [tracer file]

To see the full list and get familiar with the whole Go diagnostics ecosystem, you can read the go article about diagnostics .

Summary

One of the hardest tasks in software engineering, is detecting non-obvious, hidden issues. While you will find an error log in a place where you expect a potential error to occur, or look at a histogram of a memory-leaking process, a good profile will top that by being able to show you the exact lines of code, where your program is misbehaving. Latched on our diagnostics tool-belt are debugging, logging, tracing, and metrics. It’s time for you give profiling an honorable place in this tool-belt as well.

Oh, by the way. Here the coldest place on Earth, according to the dataset: \((-79.750,43.750)\) With average of \(-57.55\degree C\) (somewhere in the middle of Antarctica).

On this page (Top)