Friday, June 20, 2014

Python Sets: Handy for Network Data

My Python-related posts seem to get the most reads, so here's another one!

A problem that comes up fairly often in networking is finding the number of occurrences of unique items in a large collection of data: let's say you want to find all of the unique IP addresses that accessed a website, traversed a firewall, got denied by an ACL, or whatever. Maybe you've extracted the following list from a log file:

1.1.1.1
2.2.2.2
3.3.3.3
1.1.1.1
5.5.5.5
5.5.5.5
1.1.1.1
2.2.2.2
...

and you need to reduce this to:

1.1.1.1
2.2.2.2
3.3.3.3
5.5.5.5

In other words, we're removing the duplicates. In low-level programming languages, removing duplicates is a bit of a pain: generally you need to implement an efficient way to sort an array of items, then traverse the sorted array to check for adjacent duplicates and remove them. In a language that has dictionaries (also known as hash tables or associative arrays), you can do it by adding each item as a key in your dictionary with an empty value, then extract the keys. In Python:

>>> items = ['1.1.1.1','2.2.2.2','3.3.3.3','1.1.1.1','5.5.5.5','5.5.5.5','1.1.1.1','2.2.2.2']
>>> d = {}
>>> for item in items:
...     d[item] = None
...
>>> d
{'5.5.5.5': None, '3.3.3.3': None, '1.1.1.1': None, '2.2.2.2': None}
>>> unique = d.keys()
>>> unique
['5.5.5.5', '3.3.3.3', '1.1.1.1', '2.2.2.2']


or, more concisely using a dictionary comprehension:

>>> {item:None for item in items}.keys()
['5.5.5.5', '3.3.3.3', '1.1.1.1', '2.2.2.2']


Python has an even better way, however: the "set" type, which emulates the mathematical idea of a set as a collection of distinct items. If you create an empty set and add items to it, duplicates will automatically be thrown away:

>>> s = set()
>>> s.add('1.1.1.1')
>>> s
set(['1.1.1.1'])
>>> s.add('2.2.2.2')
>>> s.add('1.1.1.1')
>>> s
set(['1.1.1.1', '2.2.2.2'])
>>> for item in items:
...     s.add(item)
...
>>> s
set(['5.5.5.5', '3.3.3.3', '1.1.1.1', '2.2.2.2'])


Predictably, you can use set comprehensions just like list comprehensions to do the same thing as a one liner:

>>> {item for item in items}
set(['5.5.5.5', '3.3.3.3', '1.1.1.1', '2.2.2.2'])


Or, if you have a list built already you can just convert it to a set:

>>> set(items)
set(['5.5.5.5', '3.3.3.3', '1.1.1.1', '2.2.2.2'])


Python also provides methods for the most common types of set operations: union, intersection, difference and symmetric difference. Because these methods accept lists or other iterables, you can quickly find similarities between collections of items:

>>> items
['1.1.1.1', '2.2.2.2', '3.3.3.3', '1.1.1.1', '5.5.5.5', '5.5.5.5', '1.1.1.1', '2.2.2.2']
>>> more_items = ['1.1.1.1','8.8.8.8','1.1.1.1','7.7.7.7','2.2.2.2']
>>> set(items).intersection(more_items)
set(['1.1.1.1', '2.2.2.2'])

>>> set(items).difference(more_items)
set(['5.5.5.5', '3.3.3.3'])


Have fun!