Friday, June 20, 2014

Python Sets: Handy for Network Data

My Python-related posts seem to get the most reads, so here's another one!

A problem that comes up fairly often in networking is finding the number of occurrences of unique items in a large collection of data: let's say you want to find all of the unique IP addresses that accessed a website, traversed a firewall, got denied by an ACL, or whatever. Maybe you've extracted the following list from a log file:

1.1.1.1
2.2.2.2
3.3.3.3
1.1.1.1
5.5.5.5
5.5.5.5
1.1.1.1
2.2.2.2
...

and you need to reduce this to:

1.1.1.1
2.2.2.2
3.3.3.3
5.5.5.5

In other words, we're removing the duplicates. In low-level programming languages, removing duplicates is a bit of a pain: generally you need to implement an efficient way to sort an array of items, then traverse the sorted array to check for adjacent duplicates and remove them. In a language that has dictionaries (also known as hash tables or associative arrays), you can do it by adding each item as a key in your dictionary with an empty value, then extract the keys. In Python:

>>> items = ['1.1.1.1','2.2.2.2','3.3.3.3','1.1.1.1','5.5.5.5','5.5.5.5','1.1.1.1','2.2.2.2']
>>> d = {}
>>> for item in items:
...     d[item] = None
...
>>> d
{'5.5.5.5': None, '3.3.3.3': None, '1.1.1.1': None, '2.2.2.2': None}
>>> unique = d.keys()
>>> unique
['5.5.5.5', '3.3.3.3', '1.1.1.1', '2.2.2.2']


or, more concisely using a dictionary comprehension:

>>> {item:None for item in items}.keys()
['5.5.5.5', '3.3.3.3', '1.1.1.1', '2.2.2.2']


Python has an even better way, however: the "set" type, which emulates the mathematical idea of a set as a collection of distinct items. If you create an empty set and add items to it, duplicates will automatically be thrown away:

>>> s = set()
>>> s.add('1.1.1.1')
>>> s
set(['1.1.1.1'])
>>> s.add('2.2.2.2')
>>> s.add('1.1.1.1')
>>> s
set(['1.1.1.1', '2.2.2.2'])
>>> for item in items:
...     s.add(item)
...
>>> s
set(['5.5.5.5', '3.3.3.3', '1.1.1.1', '2.2.2.2'])


Predictably, you can use set comprehensions just like list comprehensions to do the same thing as a one liner:

>>> {item for item in items}
set(['5.5.5.5', '3.3.3.3', '1.1.1.1', '2.2.2.2'])


Or, if you have a list built already you can just convert it to a set:

>>> set(items)
set(['5.5.5.5', '3.3.3.3', '1.1.1.1', '2.2.2.2'])


Python also provides methods for the most common types of set operations: union, intersection, difference and symmetric difference. Because these methods accept lists or other iterables, you can quickly find similarities between collections of items:

>>> items
['1.1.1.1', '2.2.2.2', '3.3.3.3', '1.1.1.1', '5.5.5.5', '5.5.5.5', '1.1.1.1', '2.2.2.2']
>>> more_items = ['1.1.1.1','8.8.8.8','1.1.1.1','7.7.7.7','2.2.2.2']
>>> set(items).intersection(more_items)
set(['1.1.1.1', '2.2.2.2'])

>>> set(items).difference(more_items)
set(['5.5.5.5', '3.3.3.3'])


Have fun!

4 comments:

Jeremy Schulman said...

Hi Jay,

Another cool think I tend to use for similar purposes is the "collections.Counter". This will give you both a unique set of keys *and* a count of each item.

For example, I might want to gather all the items from a device inventory that has serial-numbers, and then get a listing and count.

>> from collections import Counter

In the snippet below, the variable "sn" is a Table object (created via the Junos PyEZ library). The "sn" variable is iterable (like a dictionary), so I can use it to build a list via compression and then pass that as the input to the Counter constructor:

>>> catalog = Counter([item.desc for item in sn])
>>>

You can dump the entire collection using the "items()" method:

>>> # pretty-print the catalog items
>>> pprint( catalog.items() )
[('XFP-10G-SR', 2),
('MX FPC Type 2', 1),
('MX SCB', 2),
('8x 1GE(LAN), IQ2', 1),
('RE-S-1800x4', 2),
('Front Panel Display', 1),
('SFP-T', 2),
('DPCE 40x 1GE R', 1),
('DPC PMB', 4),
('DPCE 40x 1GE R EQ', 1),
('SFP-SX', 26),
('PS 1.2-1.7kW; 100-240V AC in', 3),
('DPCE 20x 1GE + 2x 10GE R', 1),
('MX480', 1),
('MX480 Midplane', 1)]

Or access a specific item by name:

>>> catalog['SFP-SX']
26

Hope this helps!

Jay Swan said...

collections.Counter is awesome. You saved me from writing a post about it!

sets, collections and itertools are probably my favorite "secrets" from the standard library.

Kirk Byers said...

Good stuff Jay. Sets are definitely underrated especially when combined with the set operations.

I used a set difference yesterday. Two email lists and I wanted to only send to the individuals that were in list A that were not in list B. Sets made this much easier.

I also liked your dictionary comprehensions and set comprehensions.

Anonymous said...

If you want create unique list try:

In [1]: list(set(['123', '213', '123']))
Out[1]: ['123', '213']