How can I use NetFlow to track the websites being accessed from my network?
The short answer that I usually give on the forum is this: you can't, because NetFlow doesn't track HTTP headers. With this blog post, though, I'll go into the answer in more detail so that I can refer people to it in the future.
[Edited to add: I'm really talking about traditional router-based NetFlow here, typically version 5. See the end of the post and the comments for information on pcap-based tools and firewall-based tools that do HTTP header export.]
First, a quick review of what NetFlow is, and how it works:
- When NetFlow is enabled on a router interface, the router begins to track information about the traffic that transits the interface. This information is stored in a data structure called the flow cache.
- Periodically, the contents of the flow cache can be exported to a "collector", which is a process running on an external system that receives and stores flow data. This process is called "NetFlow Data Export", or NDE. Typically the collector is tied into an "analyzer", which massages the flow data into something useful for human network analysts.
- NDE is optional. One can gather useful information from NetFlow solely from the command-line without ever using an external collector.
- Data that can be tracked by NetFlow depends on the version. The most commonly deployed version today is NetFlow version 5, which tracks the following key fields:
- Source interface
- Source and destination IP address
- Layer 4 protocol (e.g., ICMP, TCP, UDP, OSPF, ESP, etc.)
- Source and destination port number (if the layer 4 protocol is TCP or UDP)
- Type of service value
- These "key fields" are used to define a "flow"; that is, a unidirectional conversation between a pair of hosts. Because flows are unidirectional, an important feature in NetFlow analysis software is the ability to pair the two sides of a flow to give a complete picture of the conversation.
- Other "non-key" fields are also tracked. In NetFlow version 5, the other fields are as follows. Note that not all collector software preserves all the fields.
- TCP flags (used by the router to determine the beginning and end of a TCP flow)
- Egress interface
- Packet and byte count for the flow
- BGP origin AS and peer AS
- IP next-hop
- Source and destination netmask
- NetFlow v9, Cisco Flexible NetFlow, and IPFIX (the IETF flow protocol, which is very similar to NetFlow v9) allow user-defined fields that can track any part of the packet headers.
- Many vendors have defined other flow protocols that offer more or fewer capabilities, but virtually all of them duplicate at least the functions of NetFlow v5.
SrcAddr: 18.104.22.168 (22.214.171.124)
DstAddr: 10.118.218.102 (10.118.218.102)
NextHop: 0.0.0.0 (0.0.0.0)
[Duration: 1.388000000 seconds]
StartTime: 3422510.740000000 seconds
EndTime: 3422512.128000000 seconds
DstPort: 445 <-- probably a port scan for open Microsoft services
TCP Flags: 0x02
Protocol: 6 <-- this is the layer 4 protocol; i.e. TCP
IP ToS: 0x00
SrcAS: 4768 <-- this particular router is tracking BGP Origin-AS
SrcMask: 22 (prefix: 126.96.36.199/22)
DstMask: 30 (prefix: 10.118.218.100/30)
NetFlow isn't a good web usage tracker because nowhere in the list of fields above do we see "HTTP header". [see note 1] The HTTP header is the part of the application layer payload that actually specifies the website and URL that's being requested. Here's a sample from another packet capture:
GET / HTTP/1.1
User-Agent: curl/7.21.6 (i686-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/188.8.131.52 libidn/1.22 librtmp/2.3
This is the request sent by the HTTP client (in this case the "curl" command-line HTTP utility) when accessing http://www.ubuntu.com. The header "GET / HTTP/1.1" command requests the root ("/") of the website referenced by the "Host:" field; i.e. www.ubuntu.com.
The IP address used in this request was 184.108.40.206. However if we do a reverse lookup on this address, the record returned is different:
$ dig -x 220.127.116.11 +short
A little search-engine-fu shows that several other websites are hosted at the same IP address:
If we do the same trick with other websites (like unroutable.blogspot.com, hosted by Google), we can easily find cases in which there are dozens of websites hosted at the same IP address.
Because NetFlow doesn't extract the HTTP header from TCP flows, we have only the IP address to go on. As we've seen here, many different websites can be hosted at the same IP address; there's no way to tell just from NetFlow whether a user visited www.canonical.com or www.ubuntu.com. Furthermore, with the most popular sites hosted on content distribution caches or cloud service providers, the reverse DNS lookups for high-bandwidth port 80 flows frequently resolve to names in networks like Akamai, Limelight, Google, Amazon Web Services, Rackspace, etc., even if those content distribution networks have nothing to do with the content of the actual website that was visited.
The bottom line is this: if you want to track what websites are visited by users on a network, NetFlow isn't the best tool, or even a good one. A web proxy (e.g., Squid) or a web content filter (e.g., Websense, Cisco WSA, etc.) is a probably the best tool, since they track not only HTTP host headers but also (usually) the Active Directory username associated with the request.
Other tools that could do the job are security related tools like httpry or Bro-IDS, both of which have features for HTTP request tracking. These tools are both available in the excellent Security Onion Linux distribution.
[Edited to add] The anonymous commenter below observes that nProbe exports HTTP header information via IPFIX, and notes that some vendors have firewalls that do so as well. nProbe is an excellent free tool that takes a raw packet stream and converts it to NetFlow or IPFIX export format.
[Note 1]: There is a feature in Cisco's Flexible NetFlow that allows a device to export a fixed length slice of the raw packet data. I suppose this could be used to implement primitive HTTP header extraction in the analysis software, but it would be an inefficient way of doing so, since it requires copying large parts of the packet during NetFlow export. The documentation warns about the performance implications of using the feature. I'm not aware of any collector/analyzer packages that make use of this feature from IOS; it would be much more efficient to use a tap or a mirrored port in conjunction with purpose-built tools like those mentioned above. See the comment below for information on how Plixer does this in conjunction with nProbe.