The script I'm presenting here isn't one of the actual ones that I run in production, but it's close. The real one sends emails, does DNS lookups, keeps a "rare messages" database using sqlite3, and a few other things, but I wanted to keep this simple.
One of the problems I see with getting started with log analysis is that people tend to approach it like a typical vendor RFP project: list some requirements, survey the market, evaluate and buy a product to fit your requirements. Sounds good, right? The problem with log analysis is that often you don't know what your requirements really are until you start looking at data.
A simple message counting script like this lets you look at your data, and provides a simple platform on which you can start to iterate to find your specific needs. It also lets us look at some cool Python features.
I don't recommend pushing this too far: once you have a decent idea of what your data looks like and what you want to do with it, set up Logstash, Graylog2, or a similar commercial product like Splunk (if you can afford it).
That said, here's the Python:
I tried to make this as self-documenting as possible. You run it from the CLI with a syslog file as the argument, and you get this:
$ python simple_syslog_count.py sample.txt
214 SEC-6-IPACCESSLOGP
15 SEC-6-IPACCESSLOGRL
10 LINEPROTO-5-UPDOWN
10 LINK-3-UPDOWN
7 USER-3-SYSTEM_MSG
4 STACKMGR-4-STACK_LINK_CHANGE
4 DUAL-5-NBRCHANGE
3 IPPHONE-6-UNREGISTER_NORMAL
3 CRYPTO-4-PKT_REPLAY_ERR
3 SEC-6-IPACCESSLOGRP
3 SEC-6-IPACCESSLOGSP
2 SSH-5-SSH2_USERAUTH
2 SSH-5-SSH2_SESSION
2 SSH-5-SSH2_CLOSE
10.1.16.12
6 SEC-6-IPACCESSLOGP
10.1.24.3
2 LINEPROTO-5-UPDOWN
2 LINK-3-UPDOWN
[Stuff deleted for brevity]
For Pythonistas, the script makes use of a few cool language features:
Named, Compiled rRgexes
- We can name a regex match with the (?PPATTERN) syntax, which makes it easy to understand it when it's referenced later with the .group('') method on the match object.
- This is demonstrated in lines 36-39 and 58-59 of the gist shown above.
- It would be more efficient to capture these fields by splitting the line with the .split() string method, but I wanted the script to work for unknown field positions -- hence the regex.
Multiplication of Strings
- We control indentation by multiplying the ' ' string (that a single space enclosed in quotes) by an integer value in the print_counter function (line 50).
- The reason this works is that the Python str class defines a special __mul__ method that controls how the * operator works for objects of that class:
 >>> 'foo'.__mul__(3)
 'foofoofoo'
 >>> 'foo' * 3
 'foofoofoo'
collections.Counter Objects
- Counter objects are a subclass of dictionaries that know how to count things. Jeremy Schulman talked about these in a comment on the previous post. Here, we use Counters to build both the overall message counts and the per-device message counts:
>>> my_msg = 'timestamp ip_address stuff %MY-4-MESSAGE:other stuff'
>>> CISCO_MSG = re.compile('%(?P.*?):') 
>>> from collections import Counter
>>> test_counter = Counter()
>>> this_msg = re.search(CISCO_MSG,my_msg).group('msg')
>>> this_msg
'MY-4-MESSAGE'
>>> test_counter[this_msg] += 1
>>> test_counter
Counter({'MY-4-MESSAGE': 1})
collections.defaultdict Dictionaries
- It could get annoying when you're assigning dictionary values inside a loop, because you get errors when the key doesn't exist yet. This is a contrived example, but it illustrates the point:
 >>> reporters = {}
 >>> for reporter in ['1.1.1.1','2.2.2.2']:
 ... reporters[reporter].append['foo']
 ...
 Traceback (most recent call last):
 File "", line 2, in 
 KeyError: '1.1.1.1'
 
- To fix this, you can catch the exception:
 >>> reporters = {}
 >>> for reporter in ['1.1.1.1','2.2.2.2']:
 ... try:
 ... reporters[reporter].append['foo']
 ... reporters[reporter].append['bar']
 ... except KeyError:
 ... reporters[reporter] = ['foo']
 ... reporters[reporter].append('bar')
- As usual, though, Python has a more elegant way in the collections module: defaultdict
>>> from collections import defaultdictIn the syslog counter script, we use a collections.Counter object as the type for our defaultdict. This allows us to build a per-syslog-reporter dictionary that shows how many times each message appears for each reporter, while only looping through the input once (line 66):
>>> reporters = defaultdict(list)
>>> for reporter in ['1.1.1.1','2.2.2.2']:
... reporters[reporter].append('foo')
... reporters[reporter].append('bar')
>>> reporters
defaultdict(, {'1.1.1.1': ['foo', 'bar'], '2.2.2.2': ['foo', 'bar']}) 
per_reporter_counts[reporter][msg] += 1
Here, the dictionary per_reporter_counts has the IPv4 addresses of the syslog reporters as keys, with a Counter object as the value holding the counts for each message type:
>>> from collections import Counter,defaultdict
>>> per_reporter_counts = defaultdict(Counter)
>>> per_reporter_counts['1.1.1.1']['SOME-5-MESSAGE'] += 1
>>> per_reporter_counts
defaultdict(
>>> per_reporter_counts['1.1.1.1']['SOME-5-MESSAGE'] += 5
>>> per_reporter_counts
defaultdict(
If you got this far, you can go implement it for IPv6 addresses. :-)
 
5 comments:
It is very cool to be able to do that in Python. I certainly need to pick up some skills there and this serves as a good example. I currently do something very similar to this by chaining together awk, count and sort. Thanks for sharing.
I know a lot of people swear by Python but is it really better than its' competitors?
JH
Nice syslog counter, thanks! Python is great for math stuff in my experience.
thanks for sharing this blog
spring mvc interview questions
machine learning online training
servlet interview questions mytectra.in
wcf interview questions
Thank you for sharing useful information. Best Python Online Training || Learn Python Course
Post a Comment