Legal Information |
|
Tier II servers provide the function of a message transfer agent from one distribution module to another. Bridgehead servers send e-mail upstream to the IMS and to other bridgehead servers.
Problems on the bridgehead servers affect all servers within that site and any other sites trying to communicate with the problem site. Messages will still travel within its site, but not to any outside sites, including the IMS and the Internet. Bridgehead server failures will cause messages to queue at both its own queues and the queues of any sites attempting to communicate with the problem site.
The messages will remain in queue until the connection is re-established or they are removed with a diagnostic tool.
Consider scheduled downtime when monitoring Exchange services and processes. Exchange administrators should be aware of servers scheduled for maintenance to avoid false alerts from the monitors.
Also, temporarily disable any "auto-fix" type of monitors during scheduled maintenance. Suggestion: disable all monitors during the same part of the day that maintenance is scheduled to occur.
First, make sure the efforts of Exchange administrators and those performing maintenance are coordinated. All EventLog ID numbers assume use of Microsoft Exchange version 5.0. EventLog IDs for Exchange 5.5 may differ, but the problem description and resolution will remain the same.
Problem Description | Method of Detection | Recommended Action | Monitoring Interval | Severity | Threshold | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Database problems Database too fragmented | EventLog ID 65 detected | Use "edbutil" to defrag database (should be done by Exchange admins. Only) | 15 min. | 2 | 1 time every 3 months | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Database in inconsistent state (This message may also appear in the Directory or Information Store database, in the case of a power failure. This error usually means that the database is in an inconsistent state and cannot start.) | Event Log ID Error -550 has occurred | Confirm that the database state database is inconsistent, and then try a defragmentation repair Stop all services and backup all files before you manually run the Edbutil.exe program.
| 15 min. | 2 | 1
Recovery Database reaching capacity | EventLog ID 1112 detected or IS size reached 80% of logical disk capacity | Normally logged after database has shutdown for reaching capacity, this requires that the server run edbutil /d to free space up. After completion of edbutil database, restart Information store. | 20 min. | 2 | 1
| Database cache hit rate too low | Monitor the database buffer cache hit ratio for the IS and DS database | DS and IS buffers can be increased if there is sufficient RAM. If these fall below 95% frequently, it indicates the buffers are too low. To correct the problem, manually run
| perfwiz -v .30 min. | 3 | Baseline
| MTA messages per second too low or too high | Monitor the number of messages being processed by the MTA. | Check the status of the MTA and the CPU and memory consumption of the processes. | 15 min. | 1 | Baseline
| MTA process is down | Monitor the number of threads in use by the MTA | Restart MTA Service. If service fails to restart, restart ALL Exchange services in order. | 10 min. | 1 | 1
| MTA Work Queue length too high | Monitor MTA Queue length on server | Check the MTA Service is up and the MTA service on upstream connections (i.e. if MTA queue length of bridgehead server is too high, check the MTA on the IMS) | 15 min. | 23 | Baseline
| Directory updates failed | EventLog ID 1171 detected - exception event | Directory Service Problem followed by a 1214 Error in the Event log indicates a Server failing on a deletion or addition of a directory object. Contact Microsoft (PSS) for troubleshooting | 15 min. | 2 | 1
| Directory updates failed | EventLog ID 1214 detected - KCC event | Knowledge consistency checker fails to complete successfully. Indicates a corruption in the Directory schema that may affect more than one (1) server in a site or Organizational Unit. Contact Microsoft PSS for troubleshooting | 15 min. | 2 | 1
| Directory Services Pending Replications too high | Monitor the number of pending replications in the DS | Huge lag in Directory updates may indicate a problem with Network connectivity to other bridgehead servers and confirms that the ability to ping other Bridgehead servers that this server uses for directory replication still exists. This can also occur with servers in the same site. | 30 min. | 2 | Baseline
| Directory Services remaining replication updates not decreasing | Monitor the number of objects being processed by the DS | Indicates a directory problem with either the server failing to exchange directory replication messages due to Network issues or directory problems. Check Event logs on Server for details | 30 min. | 2 | Baseline
| Overall Exchange problems | Exchange services down | Monitor the service control manager to detect status of services.Check all Exchange services. Restart services that are down in order. Gracefully reboot server if necessary. | 5 min. | 1 | 1
| Exchange process dead | Monitor the CPU and thread utilization of the Exchange processes | Check all of the Exchange services. Restart services that are down in order. Gracefully reboot server if necessary | 5 min. | 1 | Baseline
| Runaway Exchange process | Monitor the CPU and memory utilization of the Exchange processes | Check all of the Exchange services. Restart services that are down in order. Gracefully reboot server if necessary | .5 min. | 1 | Baseline
| Paging too high | Monitor the paging frequency of the operating systems (pagefile usage) | Excessive paging requires the need for upgrade of memory. If paging persists, treat as bug. | 10 min. | 2 | Baseline
| Low logical disk free space | Monitor the logical disk space of the Exchange machines | Delete unnecessary files to free up disk space. Install new disk space if necessary. | 15 min. | 1 | Baseline
| CPU Queue Length too high | Monitor the overall queue length of the CPU over a prolonged period. | Use Performance Monitor to identify CPU bottlenecks and rectify as necessary | 30 min. | 2 | Baseline
| Compaq Insight Manager errors | Monitor the internal temperature of server | Check hardware for errors | 10 min. | 1 | Baseline
| Compaq Insight Manager errors | Monitor any critical IDE or SCSI disk failures | Check hardware for errors | 10 min. | 1 | Baseline
| Compaq Insight Manager errors | Monitor NIC failures | Check hardware for errors. | 10 min.. | 1. | Baseline
| Compaq Insight Manager errors | Monitor any fan failures | Check hardware for errors | 10 min. | 1 | Baseline
| Compaq Insight Manager errors | Monitor any correctable memory errors | Check hardware for errors | 10 min. | 1 | Baseline
| Network utilization high | Monitor total bytes per second processed by network interface card. | Check and/or tune performance of NIC card. | 10 min. | 3 | Baseline
| ICMP errors | Monitor the receipt time for ICMP packets | Check and/or tune performance of NIC card. | 10 min. | 3 | Baseline
| ICMP errors | Monitor the level of unreachable destinations | Check and/or tune performance of NIC card. | 10 min. | 3 | Baseline
| |
Search Knowledge Base | Feedback |