The script I'm presenting here isn't one of the actual ones that I run in production, but it's close. The real one sends emails, does DNS lookups, keeps a "rare messages" database using sqlite3, and a few other things, but I wanted to keep this simple.
One of the problems I see with getting started with log analysis is that people tend to approach it like a typical vendor RFP project: list some requirements, survey the market, evaluate and buy a product to fit your requirements. Sounds good, right? The problem with log analysis is that often you don't know what your requirements really are until you start looking at data.
A simple message counting script like this lets you look at your data, and provides a simple platform on which you can start to iterate to find your specific needs. It also lets us look at some cool Python features.
I don't recommend pushing this too far: once you have a decent idea of what your data looks like and what you want to do with it, set up Logstash, Graylog2, or a similar commercial product like Splunk (if you can afford it).
That said, here's the Python:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from collections import Counter,defaultdict | |
import re | |
import sys | |
""" | |
Counterpart to this blog post: | |
http://unroutable.blogspot.com/2014/07/simple-python-syslog-counter.html | |
Summarize counts of typical Cisco syslog messages. Most syslog servers produce lines that look something like this: | |
TIMESTAMP FACILITY.SEVERITY REPORTER_IP %SOME-X-MESSAGE_TYPE: message details | |
However, the format may vary depending on server. | |
This script extracts the first IPv4 address from the line and assumes that's the reporting device. | |
It then extracts the MESSAGE_TYPE field, if present. | |
It summarizes the count of each unique message type in the input, then summarizes the counts | |
of each message type, per reporting IPv4 address: | |
10 SOME-5-MESSAGE | |
4 SOME-4-OTHERMESSAGE | |
10.1.1.1 | |
4 SOME-5-MESSAGE | |
2 SOME-4-OTHERMESSAGE | |
10.1.1.2 | |
6 SOME-5-MESSAGE | |
2 SOME-4-OTHERMESSAGE | |
""" | |
# input file is first argument on the CLI | |
# you could easily make this a list or implement CLI switches with argparse | |
FILENAME = sys.argv[1] | |
# monster named regex to match IPv4 addresses, courtesy of Logstash grok pattern code | |
IPV4 = re.compile('(?P<ipv4>(?<![0-9])(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}))(?![0-9]))') | |
# named regex to match Cisco syslog message types | |
CISCO_MSG = re.compile('%(?P<msg>.*?):') | |
# counter object for counting all message types in file | |
msg_count = Counter() | |
# dictionary of counts per syslog-sending IP address | |
per_reporter_counts = defaultdict(Counter) | |
def print_counter(counter,indent=0): | |
""" print contents of a Counter object with or without indent. """ | |
for item,count in counter.most_common(): | |
print '%s %-5i %s' % (' '*indent, count, item) | |
def main(): | |
with open(FILENAME) as f: | |
for line in f: | |
try: | |
# match first IPv4 address and first Cisco-like message-type block | |
reporter = re.search(IPV4,line).group('ipv4') | |
msg = re.search(CISCO_MSG,line).group('msg') | |
except: | |
# if regex match fails, this line is not a typical Cisco syslog message | |
continue | |
# increment message counter | |
msg_count[msg] += 1 | |
# increment message counter per reporting IP address | |
per_reporter_counts[reporter][msg] += 1 | |
# code below here just prints the results | |
print_counter(msg_count) | |
for reporter, counter in per_reporter_counts.items(): | |
print reporter | |
print_counter(counter,indent=4) | |
if __name__ == '__main__': | |
main() |
$ python simple_syslog_count.py sample.txt
214 SEC-6-IPACCESSLOGP
15 SEC-6-IPACCESSLOGRL
10 LINEPROTO-5-UPDOWN
10 LINK-3-UPDOWN
7 USER-3-SYSTEM_MSG
4 STACKMGR-4-STACK_LINK_CHANGE
4 DUAL-5-NBRCHANGE
3 IPPHONE-6-UNREGISTER_NORMAL
3 CRYPTO-4-PKT_REPLAY_ERR
3 SEC-6-IPACCESSLOGRP
3 SEC-6-IPACCESSLOGSP
2 SSH-5-SSH2_USERAUTH
2 SSH-5-SSH2_SESSION
2 SSH-5-SSH2_CLOSE
10.1.16.12
6 SEC-6-IPACCESSLOGP
10.1.24.3
2 LINEPROTO-5-UPDOWN
2 LINK-3-UPDOWN
[Stuff deleted for brevity]
For Pythonistas, the script makes use of a few cool language features:
Named, Compiled rRgexes
- We can name a regex match with the (?P
PATTERN) syntax, which makes it easy to understand it when it's referenced later with the .group('') method on the match object. - This is demonstrated in lines 36-39 and 58-59 of the gist shown above.
- It would be more efficient to capture these fields by splitting the line with the .split() string method, but I wanted the script to work for unknown field positions -- hence the regex.
Multiplication of Strings
- We control indentation by multiplying the ' ' string (that a single space enclosed in quotes) by an integer value in the print_counter function (line 50).
- The reason this works is that the Python str class defines a special __mul__ method that controls how the * operator works for objects of that class:
>>> 'foo'.__mul__(3)
'foofoofoo'
>>> 'foo' * 3
'foofoofoo'
collections.Counter Objects
- Counter objects are a subclass of dictionaries that know how to count things. Jeremy Schulman talked about these in a comment on the previous post. Here, we use Counters to build both the overall message counts and the per-device message counts:
>>> my_msg = 'timestamp ip_address stuff %MY-4-MESSAGE:other stuff'
>>> CISCO_MSG = re.compile('%(?P.*?):')
>>> from collections import Counter
>>> test_counter = Counter()
>>> this_msg = re.search(CISCO_MSG,my_msg).group('msg')
>>> this_msg
'MY-4-MESSAGE'
>>> test_counter[this_msg] += 1
>>> test_counter
Counter({'MY-4-MESSAGE': 1})
collections.defaultdict Dictionaries
- It could get annoying when you're assigning dictionary values inside a loop, because you get errors when the key doesn't exist yet. This is a contrived example, but it illustrates the point:
>>> reporters = {}
>>> for reporter in ['1.1.1.1','2.2.2.2']:
... reporters[reporter].append['foo']
...
Traceback (most recent call last):
File "", line 2, in
KeyError: '1.1.1.1'
- To fix this, you can catch the exception:
>>> reporters = {}
>>> for reporter in ['1.1.1.1','2.2.2.2']:
... try:
... reporters[reporter].append['foo']
... reporters[reporter].append['bar']
... except KeyError:
... reporters[reporter] = ['foo']
... reporters[reporter].append('bar')
- As usual, though, Python has a more elegant way in the collections module: defaultdict
>>> from collections import defaultdictIn the syslog counter script, we use a collections.Counter object as the type for our defaultdict. This allows us to build a per-syslog-reporter dictionary that shows how many times each message appears for each reporter, while only looping through the input once (line 66):
>>> reporters = defaultdict(list)
>>> for reporter in ['1.1.1.1','2.2.2.2']:
... reporters[reporter].append('foo')
... reporters[reporter].append('bar')
>>> reporters
defaultdict(, {'1.1.1.1': ['foo', 'bar'], '2.2.2.2': ['foo', 'bar']})
per_reporter_counts[reporter][msg] += 1
Here, the dictionary per_reporter_counts has the IPv4 addresses of the syslog reporters as keys, with a Counter object as the value holding the counts for each message type:
>>> from collections import Counter,defaultdict
>>> per_reporter_counts = defaultdict(Counter)
>>> per_reporter_counts['1.1.1.1']['SOME-5-MESSAGE'] += 1
>>> per_reporter_counts
defaultdict(
>>> per_reporter_counts['1.1.1.1']['SOME-5-MESSAGE'] += 5
>>> per_reporter_counts
defaultdict(
If you got this far, you can go implement it for IPv6 addresses. :-)
5 comments:
It is very cool to be able to do that in Python. I certainly need to pick up some skills there and this serves as a good example. I currently do something very similar to this by chaining together awk, count and sort. Thanks for sharing.
I know a lot of people swear by Python but is it really better than its' competitors?
JH
Nice syslog counter, thanks! Python is great for math stuff in my experience.
thanks for sharing this blog
spring mvc interview questions
machine learning online training
servlet interview questions mytectra.in
wcf interview questions
Thank you for sharing useful information. Best Python Online Training || Learn Python Course
Post a Comment