In a well managed IT infrastructure, network monitoring acts as the eyes and ears of an organisation to spot problems before they appear. Systems administrators need complete visibility into their critical IT components such as servers, applications and networks. These tools can monitor a server crash, a failing application, or in some cases, highly utilised network bandwidth.
Features of network monitoring tools
A network monitoring tool is usually hosted on a standalone server and runs its client software on each machine to be managed or monitored. The tool usually runs its own copy of the database such as MySQL or Postgres, which stores all scripts, historic events and actions. In some modern tools, an agent is not required to be run on managed machines, making it an agent-less installation. Table 1 lists a stack of features with examples, which must be available by default in a network monitoring tool.
When it comes to monitoring large scale IT infrastructures, systems administrators need architecture with much-advanced features to make their life easy. Given below is a list of some important features.
Auto discovery: It is cumbersome for administrators to add each managed device manually. Modern monitoring tools span the entire network segment to enumerate devices and perform auto discovery of operating system, configuration and settings. This feature automatically helps admins to get a glimpse of their IT inventory.
Network traffic stats: Earlier, monitoring tools used to just look at the CPU, memory and disk utilisation. However, this is not enough and network bandwidth usage is a key factor to be aware of, especially when the managed machines are supposed to access the Internet. Besides, by monitoring network traffic, admins get an insight into the bandwidth usage of the Internet service providers line, which helps them make a capacity planning decision.
Log monitoring: All operating systems create activity logs. For example, in case of Linux, SSH logs and bash logs are created, while for Windows, the application, system and security event logs are generated. A good tool must be capable of reading and parsing log files. This sounds easy but can be tricky, because the operating system opens log files and locks those that require tools to sneak into the file without tampering or corrupting it. Monitoring tools should be able to check log file size, parse text for particular string patterns, etc, and perform configured actions. This gives a lot of power to admins to tune their infrastructure monitoring for better control.
Device grouping: This is important for easy management of devices such as firewalls, servers, etc, in specific groups. In some cases, administrators choose to create department wise groups, or a group for each building or floor. They populate these groups with network switches, servers and desktops pertaining to that department or floor. In a growing infrastructure, this feature is very important.
Alert management: Merely monitoring a network is not enough. A good tool should let admins produce alerts. For example, if the CPU of a critical server crosses 90 per cent of usage, or if a firewall is dropping multiple packets in a row, it should create a trouble ticket, and email, or optionally send a short message to the admin’s mobile phone. Almost all tools provide such facilities today to enhance their usefulness; however, admins should look into the configurability and facilities available in alert management, prior to selecting the proper tool.
Customisable Web dashboard: A good monitoring tool should let admins access its statistics over a Web interface. Besides, the Web interface must be customisable to let them decide what should be on the dashboard’s front page. Modern tools provide widgets that are small screen sections or windows, which can show monitoring statistics of the admin’s choice and can be moved or removed.
Integrating with helpdesk: Recording of events that are the result of threshold violation is very important and should be an automated process. The monitoring tool should provide the necessary hooks or connectors so that the trouble ticket/helpdesk system can be easily connected. Monitoring of events and the applicable actions should result in a trouble ticket. This helps decide how much manpower ought to be utilised to address those events and intelligent action can be taken based on that data.
Report generation: All monitoring tools today provide some level of report generation, which is based on the date, time, etc. However, a detailed report that is device-specific or event-specific is really essential for an admin. For example, a report generator should be able to drill down into a particular event such as a TCP timeout on a particular server, and provide historic occurrences of that event for that server. These levels of granular details help administrators establish a co-relation between the event and its root cause.
The following is a list of new features found in commercial monitoring tools; however, the open source world will surely catch up in the days to come.
Plug-in API support: While a few open source tools do provide this, there is still a scope for improvement. API calls of the monitoring engine can be exposed in a secure way, so that developers can write their own plug-ins. This is especially important when there is a new network device or software application in the market that must be monitored.
Trend analysis: The network or server monitoring industry is rapidly moving away from the preventive to the pro-active mode. Administrators want to know historic trends of problems and make a judgment in terms of the corrective actions to be taken today, to prevent problems that might happen tomorrow. For example, continuously high CPU utilisation on a MySQL server over a period suggests that one or more stored procedures are either not optimised or misbehaving. This can be related to the application that uses those procedures. Thus, if that application is expected to be used more, an analysis of the trends can tell admins that the MySQL server is going to run into trouble.
Security monitoring: Very soon, no monitoring tool will be useful unless it supports cyber security monitoring. Attacks happening at Layer 2 and 3, as well as application-based security problems at Layer 7, should be trapped and reported by a good monitoring tool. This functionality is available in a few commercial tools; however, incorporating Snort along with Nagios or any other monitoring tool can prove to be a powerful security monitoring solution.
Nagios, Zenoss and Zabbix
So let’s talk about the three famous open source monitoring tools-Nagios, Zenoss and Zabbix, and compare them. While there are many features to compare, we will discuss only those that matter the most to mid-scale IT infrastructure management.
Nagios: This is a famous first generation network monitoring tool and is used in all Linux distros. Developed in C and PHP, it supports multiple flavours of open source backend databases, as well as the legacy flat file structure.
Zenoss: Written using Python scripting, Zenoss provides a highly flexible monitoring platform for mid scale and large scale infrastructures. It supersedes Nagios in a few cases, especially when it comes to alert management.
Zabbix: This is really an enterprise class open source tool. Written in C and PHP, it has very elaborate dashboards that provide admins a detailed drill down.
While it is tough to decide which tool is best for monitoring, here are a few guidelines. Administrators should first look at their infrastructure from the uptime perspective and decide what needs to be really monitored, rather than checking all that they can possibly monitor. This focused approach is important because it is easy to get distracted with multiple features available in each tool. Hence, focusing on the basic monitoring requirements mentioned earlier should be first on the agenda. As a second step, admins should look into the applications to be monitored and decide whether or not custom scripting needs to be done to achieve what they need from the monitoring standpoint. The third step should be to focus on reporting and trend analysis, because as infrastructure grows, it is essential to have a historic record of the problems in IT infrastructure.
The last but important step would be to see if security monitoring is a requirement in the given scenario. If yes, then it is crucial to decide the level of additional scripting and log generation that would be required. The generated log can then be captured by a monitoring tool and report as a problem incidence via the trouble ticketing system.
Nagio, Zenoss and Zabbix are all industry grade, professional tools with large installation bases. It has been observed that Nagios and Zenoss perform very well on the Ubuntu platform, while Zabbix runs great on other distros. Zenoss is unique among the three tools compared, because it offers more features, interacts well with multiple databases and other tools, and also has proved itself to be a robust solution even for high performing, large scale IT infrastructures. Besides these three, there are tools such as Cacti, OpenNMS, Cricket, etc, which I leave readers to find more about on the Net. It is always better to compare an open source tool with a commercial one, and decide and choose the required features.