<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Monitoring | Ziji's Homepage</title><link>https://zijishi.xyz/tag/monitoring/</link><atom:link href="https://zijishi.xyz/tag/monitoring/index.xml" rel="self" type="application/rss+xml"/><description>Monitoring</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><copyright>Ziji Shi © 2025</copyright><lastBuildDate>Tue, 27 Jun 2023 02:25:27 +0800</lastBuildDate><image><url>https://zijishi.xyz/media/icon_hu_926934747de47144.png</url><title>Monitoring</title><link>https://zijishi.xyz/tag/monitoring/</link></image><item><title>Linux System Monitoring Tools</title><link>https://zijishi.xyz/post/linux/linux-ssytem-monitoring-tools/</link><pubDate>Tue, 27 Jun 2023 02:25:27 +0800</pubDate><guid>https://zijishi.xyz/post/linux/linux-ssytem-monitoring-tools/</guid><description>&lt;p&gt;A short cheat sheet of the tools I reach for when a Linux box misbehaves — grouped by what resource you&amp;rsquo;re investigating. One thing worth calling out upfront: &lt;code&gt;nvidia-smi&lt;/code&gt;&amp;rsquo;s GPU utilization number doesn&amp;rsquo;t mean what most people think it means. More on that below.&lt;/p&gt;
&lt;h2 id="disk-io"&gt;Disk I/O&lt;/h2&gt;
&lt;p&gt;Please see &lt;a href="https://www.opsdash.com/blog/disk-monitoring-linux.html" target="_blank" rel="noopener"&gt;this blog&lt;/a&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;iostat : it collects disk statistics, waits for the given amount of time, collects them again and displays the difference&lt;/li&gt;
&lt;li&gt;iotop : it is a top-like utility for displaying real-time disk activity&lt;/li&gt;
&lt;li&gt;dstat : list all processes that are having effect on system-level changes (like doing I/O).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="network"&gt;Network&lt;/h2&gt;
&lt;h3 id="iperf"&gt;&lt;code&gt;iperf&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;iperf&lt;/code&gt; use client-server architecture. To run it, you need to install it on both nodes, then run it in server mode on one end, client mode on the other one. You may see &lt;a href="https://zijishi.xyz/post/linux/iperf-tuning-guide/" target="_blank" rel="noopener"&gt;this post&lt;/a&gt; for more info.&lt;/p&gt;
&lt;h2 id="memory"&gt;Memory&lt;/h2&gt;
&lt;h3 id="monitor-available-memory"&gt;Monitor available memory&lt;/h3&gt;
&lt;p&gt;Please refer to my previous post on &lt;a href="https://zijishi.xyz/post/linux/monitor-memory-usage-with-free/" target="_blank" rel="noopener"&gt;&lt;code&gt;free&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="cpu"&gt;CPU&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;top&lt;/code&gt; : display and update sorted information about processes&lt;/li&gt;
&lt;li&gt;&lt;code&gt;htop&lt;/code&gt; : interactive process viewer (aka prettier version of &lt;code&gt;top&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="nvidia-gpu"&gt;Nvidia GPU&lt;/h2&gt;
&lt;p&gt;Use &lt;code&gt;nvidia-smi&lt;/code&gt; to monitor GPU usage. However, there are some caveats.&lt;/p&gt;
&lt;h3 id="gpu-utilization"&gt;GPU Utilization&lt;/h3&gt;
&lt;p&gt;It is worth noting that GPU utilization as reported by nvidia-smi is not a reflection of how busy the GPU is. It is a measure of how much time &lt;code&gt;any of the SMs&lt;/code&gt; executing a kernel over the past second.&lt;/p&gt;
&lt;p&gt;If the GPU is idle, this percentage will be near 0%. If it is busy, but not doing useful work, this percentage will be near 100%. For example, if you launch a kernel that does no work, it will report 100% utilization. If you launch a kernel that does some work, it will report a lower utilization. If you launch a kernel that does a lot of work, it will report 0% utilization. This is because the GPU is busy doing work, but it is not doing useful work. The GPU utilization metric is a measure of how much time the GPU is doing useful work.&lt;/p&gt;
&lt;h3 id="gpu-memory-usage"&gt;GPU Memory Usage&lt;/h3&gt;
&lt;p&gt;Percent of time over the past sample period (usually 1 second) during which global (device) memory was being read or written.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You can keep the utilization counts near 100% by simply running a kernel on a single SM and transferring 1 byte over PCI-E back and forth. Utilization is not a &amp;ldquo;how well you&amp;rsquo;re using the resources&amp;rdquo; statistic but &amp;ldquo;if you&amp;rsquo;re using the resources&amp;rdquo;. To get the SM-level information, consider using nvidia profilers.&lt;/p&gt;&lt;/blockquote&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Nvidia-smi: &lt;a href="https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t" target="_blank" rel="noopener"&gt;https://docs.nvidia.com/deploy/nvml-api/structnvmlUtilization__t.html#structnvmlUtilization__t&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;</description></item></channel></rss>