Linux System Monitoring Tools

Tue, 27 Jun 2023 02:25:27 +0800

A short cheat sheet of the tools I reach for when a Linux box misbehaves — grouped by what resource you’re investigating. One thing worth calling out upfront: nvidia-smi’s GPU utilization number doesn’t mean what most people think it means. More on that below.

Disk I/O

Please see this blog.

iostat : it collects disk statistics, waits for the given amount of time, collects them again and displays the difference
iotop : it is a top-like utility for displaying real-time disk activity
dstat : list all processes that are having effect on system-level changes (like doing I/O).

Network

`iperf`

iperf use client-server architecture. To run it, you need to install it on both nodes, then run it in server mode on one end, client mode on the other one. You may see this post for more info.

Memory

Monitor available memory

Please refer to my previous post on free.

CPU

top : display and update sorted information about processes
htop : interactive process viewer (aka prettier version of top)

Nvidia GPU

Use nvidia-smi to monitor GPU usage. However, there are some caveats.

GPU Utilization

It is worth noting that GPU utilization as reported by nvidia-smi is not a reflection of how busy the GPU is. It is a measure of how much time any of the SMs executing a kernel over the past second.

If the GPU is idle, this percentage will be near 0%. If it is busy, but not doing useful work, this percentage will be near 100%. For example, if you launch a kernel that does no work, it will report 100% utilization. If you launch a kernel that does some work, it will report a lower utilization. If you launch a kernel that does a lot of work, it will report 0% utilization. This is because the GPU is busy doing work, but it is not doing useful work. The GPU utilization metric is a measure of how much time the GPU is doing useful work.

GPU Memory Usage

Percent of time over the past sample period (usually 1 second) during which global (device) memory was being read or written.

You can keep the utilization counts near 100% by simply running a kernel on a single SM and transferring 1 byte over PCI-E back and forth. Utilization is not a “how well you’re using the resources” statistic but “if you’re using the resources”. To get the SM-level information, consider using nvidia profilers.

References

Nvidia-smi: https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t

Monitoring | Ziji's Homepage