Linux System Monitoring Tools

Disk I/O

Please see this blog.

  • iostat : it collects disk statistics, waits for the given amount of time, collects them again and displays the difference
  • iotop : it is a top-like utility for displaying real-time disk activity
  • dstat : list all processes that are having effect on system-level changes (like doing I/O).

Network

iperf

iperf use client-server architecture. To run it, you need to install it on both nodes, then run it in server mode on one end, client mode on the other one. You may see this post for more info.

Memory

Monitor available memory

Please refer to my previous post on free.

CPU

  • top : display and update sorted information about processes
  • htop : interactive process viewer (aka prettier version of top)

Nvidia GPU

Use nvidia-smi to monitor GPU usage. However, there are some caveats.

GPU Utilization

It is worth noting that GPU utilization as reported by nvidia-smi is not a reflection of how busy the GPU is. It is a measure of how much time any of the SMs executing a kernel over the past second.

If the GPU is idle, this percentage will be near 0%. If it is busy, but not doing useful work, this percentage will be near 100%. For example, if you launch a kernel that does no work, it will report 100% utilization. If you launch a kernel that does some work, it will report a lower utilization. If you launch a kernel that does a lot of work, it will report 0% utilization. This is because the GPU is busy doing work, but it is not doing useful work. The GPU utilization metric is a measure of how much time the GPU is doing useful work.

GPU Memory Usage

Percent of time over the past sample period (usually 1 second) during which global (device) memory was being read or written.

You can keep the utilization counts near 100% by simply running a kernel on a single SM and transferring 1 byte over PCI-E back and forth. Utilization is not a “how well you’re using the resources” statistic but “if you’re using the resources”. To get the SM-level information, consider using nvidia profilers.


References

  1. Nvidia-smi: https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t
Ziji SHI(史子骥)
Ziji SHI(史子骥)
Ph.D. candidate

My research interests include distributed machine learning and high-performance computing.