Mandi Walls from AOL is talking about creating actionable logs. Actionable logs are logs that provide data that can be used to fix problems. There are a few rules to start with:

  • No nonsense logging
  • Concise, easy to understand
  • Express symptoms of productions issues
  • Any that makes the log needs to be somethings that can be fixed (better signal to noise ratio)

Everytime you write to a log file, you're expending resources. The point of logging in production is diagnosing issues. You need to be able to understand the logs at 4am in the morning.

The primary goal is diagnosis and recovery of problems. Secondary goals include statistics and monitoring, insight into application behavior, and indicating potential problems. Note that these are different than the goals of development and QA logs.

Logs come in different flavors: access logs, server logs (e.g. Catalina), application logs, and special use logs for groups of activities.

Some hints:

  • Log locations should be predictable and obvious. You may want logs on different disk partitions (this stops full file systems from crashing the server). Keep old log files in an obvious place as well.
  • Rolls logs into files with timestamps in the names.
  • Logs should be human readable and easy to parse. Use real dates and times. Unix timestamps don't pass the 4am test. Good timestamps give you the ability to link server activities to external events (like network outages).
  • Create a common format for multiple applications where possible.
  • Use one line per logs message where ever possible.
  • Avoid the use of only numerical error codes in them.
  • Put URLs to external info in log messages where appropriate
  • Be consistent about severity. Saying something's "fatal" without more data isn't helpful.
  • Log at the first point the error is encountered. If a server is processing 100,000 requests per minute, waiting a minute to log something means there's lots of data in between the problem and the log entry.
  • Actively manage and prune logs to make new errors obvious.
  • Don't include usernames, logins, passwords, etc. These are development logging issues, not production.
  • An application log should have 10-25% the number of entries of the access log. Too much data hides problems.

In summary, make production logs about helping operations staff solve problems. Good logs can help solve problems. Poor logs can hinder problem solution.


Please leave comments using the Hypothes.is sidebar.

Last modified: Thu Oct 10 12:47:18 2019.