Understanding Traceroute

By Greg Gardner
greg@bah.org

Traceroute is a very handy tool written by Van Jacobson that can show you the route that packets take from one host to another. It can also be used sometimes to help debug network problems, if you know how to interpret its results.

How Does it Work?

First of all, in a simplified way, this is how traceroute works. Every IP packet can specify how many hops it can go through before it is no longer forwarded on. When a packet is no longer forwarded on, that router just forgets all about it, but it also will usually send out a message to the source host saying, "Hey, sorry, but your packet died here." So, traceroute cleverly manipulates these values so that the first round of packets it sends out to the designated host are specified such that they can only go through one hop before dying. So that first hop gets those packets, sees that it's not supposed to forward them on any further and doesn't, and then sends a message back to the source host telling it that the packets died. When traceroute receives the "your packets died here" message from the router, it knows that's the first hop. It then sends on the second round of packets specifying that they can only go through TWO hops, and the cycle continues. It finishes when it gets a response from the final destination. For each hop, traceroute then displays the RTT, Round Trip Time, or the time difference between when the probe was sent from traceroute and the time the response arrived for each packet.

Let's take a look at an example traceroute:


traceroute to amber.Berkeley.EDU (128.32.25.12), 30 hops max, 40 byte packets
 1  SF-rt2-f2.geo.net (166.90.2.13)  1.671 ms  1.02 ms  1.047 ms
 2  SF-core1-f0.geo.net (166.90.5.4)  0.753 ms  1.606 ms  0.626 ms
 3  MAE-West-h0.geo.net (166.90.1.34)  3.577 ms  3.79 ms  4.032 ms
 4  sl-mae-w-F0/0.sprintlink.net (198.32.136.11)  5.437 ms  7.123 ms  3.89 ms
 5  sl-bb2-stk-4-0.sprintlink.net (144.228.10.109)  7.773 ms  8.094 ms  8.434 ms
 6  sl-bb11-stk-4-2-155M.sprintlink.net (144.232.4.69)  9.086 ms  10.504 ms  8.171 ms
 7  sl-gw10-stk-8-0-0-155M.sprintlink.net (144.232.4.97)  8.336 ms  9.565 ms  8.434 ms
 8  sl-ucberkeley-1-1-0-T3.sprintlink.net (144.228.146.50)  12.227 ms  10.739 ms  11.901 ms
 9  f5-0.inr-666-eva.berkeley.edu (198.128.16.21)  22.43 ms  12.607 ms  12.243 ms
10  f1-0-0.inr-107-eva.Berkeley.EDU (128.32.2.1)  9.479 ms  15.837 ms  11.53 ms
11  f8-0.inr-100-eva.Berkeley.EDU (128.32.235.100)  11.978 ms  12.495 ms  10.85 ms
12  amber.Berkeley.EDU (128.32.25.12)  13.068 ms  12.883 ms  10.088 ms

As you can see, there are 12 hops from the geo.net web server to the UC Berkeley web server (amber.Berkeley.EDU) and that the Round Trip Time from us to it appears to be roughly 10-13 ms (based on those 3 numbers on the last line: 13.068 ms 12.883 ms 10.088 ms). Keep in mind that the RTT's reported are the round trip times from the source host to THAT router hop. It's not a cumulative sum of the previous times or anything like that. Each hop is going to add some time to the path, so you'd expect each hop to take a little bit more time to get to than the last. Looking at this example, you can see that this is pretty much the case here, except for slight fluctuations on the orders of milliseconds due to network traffic.

Now an important thing to know when using traceroute is what the asterisks/stars mean. If you see traceroute print out a star instead of a round trip time, that means that either your probe packet got dropped, or the reply back to you for that probe got lost along the way. This is usually referred to as "packet loss," and we will discuss this later.

Caveats and Quirks

Before we continue on, there are a couple little caveats to using traceroute that you should be aware of, so you don't accidently misinterpret the results.

The first caveat to be aware of is that sometimes it will look like the last hop on a traceroute dropped a packet, when it really didn't. This is due to both the fact that this host is the actual final destination of your traceroute probes, and how certain Operating Systems handle ICMP. (ICMP, Internet Control Message Protocol, is one protocol that machines on the Internet use to send messages to each other, and the "Your packet died here" message that traceroute relies on is an ICMP message.) Since the last hop is your destination, instead of that host sending you back an ICMP message saying "Sorry your packet died here," that host will send back a different ICMP message saying "Hi, your packet made it here, but this port is unreachable." This is because traceroute purposefully sets the probe packet's destination to be some large port number that will most likely be unreachable at the destination host because it wants to receive that "port unreachable" message back. The caveat here has to do with the fact that some OS's, such as IOS (which Cisco routers run) and Sun Solaris, purposefully drop ICMP responses like "port unreachable" if it gets too many of them in a short period of time. They do this presumably as a security precaution. So, if you were to add in more delay between probes, you wouldn't see this erroneous packet loss.

Another caveat of traceroute is that ICMP, which is the protocol traceroute relies on to get responses from each hop, is usually the lowest priority protocol. So if one router is really busy it might decide to drop ICMP messages, and you will see lots of packet loss, but that router might be forwarding on more common, higher priority traffic just fine.

Also, some sites will filter ICMP for various reasons, so it might appear in a traceroute that a site might be unreachable, but in fact it is reachable.

Tracking Down Network Problems

So now that you have a basic understanding of traceroute, it's time to learn how to use traceroute to track down network problems. The first kind of network problem that traceroute can help you debug would be a loss or lack of connectivity to a site. If you appear to be having problems reaching a remote site, like a web site, do a traceroute to that site. If the traceroute reaches that site fine, then chances are that you have connectivity to the host, but that the web server on that host crashed. But, if the packets start to die somewhere along the path, it's likely that some router along the way, or the host itself is down. Here is an example traceroute:


traceroute to 209.0.0.210 (209.0.0.210): 1-30 hops, 38 byte packets
 1  SF-rt5-fe9-0.geo.net (166.90.6.1)  0.48 ms  0.440 ms  0.378 ms
 2  SF-core1-h1.geo.net (166.90.1.17)  0.618 ms  0.571 ms  0.521 ms
 3  SF-rt2-f0.geo.net (166.90.5.7)  1.19 ms  1.94 ms  1.13 ms
 4  *  *  *
 5  *  *  *

Just remember that such a traceroute can also be an example of a firewall that is filtering packets, or a router that throws away the kinds of packets that traceroute depends on when it gets overloaded.

Debugging Network Slowdowns

Using traceroute's results to see what hops IP packets take from you to a remote host is really straight forward. However, using traceroute's results to debug where "slowness" occurs in a link is fairly tricky for a number of different reasons. The first of which is the fact that traceroute only shows you the hops from you to a remote host, not the hops from the remote host to you. So, the best way to determine where network slowness is occurring is to do a traceroute from host A to host B, and then another traceroute from host B back to host A. By looking at both, a trained eye can usually get a pretty good idea where the network slowness is occurring. This is due to the fact that pretty much every top-level ISP on the Internet uses closest-exit routing which often results in asymmetric routes (completely different routes from host A to B than from host B to A).

For instance, host A might be on the west coast using ISP X, and host B might be on the east coast using ISP Y. The path from host A to host B will then probably exit ISP X as soon as it can, most likely at some peering point on the west coast and enter ISP Y's network from there onto host B. Conversely, the path from host B to host A will most likely exit ISP Y's network as soon as it can on the east coast, and enter ISP X's network and continue on to host A.

Here's an example:


traceroute to web-proxy.geo.net (166.90.90.163)
 1  E40-RTR-E40-SERVER72-ETHER.MIT.EDU (18.72.0.1)  4 ms  4 ms  4 ms
 2  EXTERNAL-RTR-FDDI.MIT.EDU (18.168.0.12)  4 ms  4 ms  4 ms
 3  cambridge2-br2.bbnplanet.net (192.233.33.6)  4 ms  4 ms  4 ms
 4  cambridge1-br1.bbnplanet.net (4.0.2.25)  4 ms  78 ms  105 ms
 5  nyc1-br2.bbnplanet.net (4.0.2.85)  12 ms  12 ms  12 ms
 6  nynap.bbnplanet.net (4.0.1.26)  12 ms  12 ms  16 ms
 7  sprint-nap.geo.net (192.157.69.43)  94 ms  82 ms  74 ms
 8  SF-rt5-a1.geo.net (166.90.4.33)  70 ms  78 ms  74 ms
 9  SF-core1-h1.geo.net (166.90.1.17)  82 ms  78 ms  74 ms
10  SF-rt2-f0.geo.net (166.90.5.7)  273 ms  234 ms  98 ms
11  web-proxy.geo.net (166.90.90.163)  133 ms  90 ms  82 ms

traceroute to BIG-SCREW.MIT.EDU (18.72.0.176), 30 hops max, 40 byte packets
 1  SF-rt2-f2.geo.net (166.90.2.13)  1.218 ms  1.219 ms  1.479 ms
 2  SF-core1-f0.geo.net (166.90.5.4)  0.704 ms  0.68 ms  0.678 ms
 3  MAE-West-h0.geo.net (166.90.1.34)  3.926 ms  3.402 ms  4.285 ms
 4  sanjose1-br1.bbnplanet.net (198.32.184.19)  5.071 ms  4.839 ms  6.973 ms
 5  su-bfr.bbnplanet.net (4.0.1.10)  6.695 ms  6 ms  8.342 ms
 6  chicago1-br2.bbnplanet.net (4.0.3.165)  71.597 ms  70.278 ms  70.166 ms
 7  boston1-br1.bbnplanet.net (4.0.2.245)  76.612 ms 74.881 ms 75.66 ms
 8  boston1-br2.bbnplanet.net (4.0.2.250)  74.099 ms  77.012 ms  76.715 ms
 9  cambridge2-br1.bbnplanet.net (4.0.1.186)  75.399 ms  75.376 ms  74.932 ms
10  ihtfp.mit.edu (192.233.33.3)  78.895 ms  76.066 ms  76.434 ms
11  E40-RTR-FDDI.MIT.EDU (18.168.0.11)  77.556 ms  76.115 ms  75.627 ms
12  BIG-SCREW.MIT.EDU (18.72.0.176)  76.484 ms  76.226 ms  77.748 ms

Note the vastly different paths that these two traceroutes take from host A to host B and from host B to host A, each with a different number of hops. The first traceroute shows the path from MIT to geo.net goes through Sprint Nap, an exchange point in New Jersey. This makes sense, since MIT is on the east coast and BBN is using closest exit routing. The second traceroute shows that the path from geo.net in San Francisco back to MIT goes through MAE West, an exchange point in the San Francisco Bay Area, the closest exit point for geo.net.

Now, to make the issue more confusing, the second reason why tracking down network "slowness" is tricky is the fact that in networking there is no "slow" or "fast", but instead there are bandwidth and latency, which are two different concepts that can both determine how "fast" a network is. (If you are unclear on the difference between bandwidth and latency, check out a cool paper written by Stuart Cheshire called "It's the Latency, Stupid". It's a little technical, but don't be scared off by that, because it's good reading.) Or, for the Cliffs Notes version, read my paper on Bandwidth vs. Latency.

Tracking Down Packet Loss

So now we know that bandwidth is how many packets you can stuff in your pipe and that latency is the delay, and that packet loss can adversely affect both. So, in general, when trying to track down network "slowness", you should be looking for packet loss. But this can get kind of tricky because packet loss is random. So, you might actually be getting packet loss at hop #2, but with the default 3 probes per hop, maybe all 3 will get back OK. Then at later hops you will start noticing the packet loss that really occurs at hop #2, but it might look like it's occurring at hop #3. So, it's usually better to do more than 3 probes per hop. Eric Wassenaar wrote a more advanced traceroute which has a lot more options, including setting how many probes you want per hop, and listing it in a summary format of min/avg/max round trip time as well as percentage loss at each hop. This version of traceroute is available at ftp://ftp.nikhef.nl/pub/network/traceroute.tar.Z. We run this traceroute in a web-interface at http://www.noc.geo.net/cgi-bin/nph-trace-plus if you don't want to download the source and compile it. Here is an example traceroute from it with 20 probes per hop:

traceroute to www.uu.net (199.170.0.30): 1-30 hops, 38 byte packets
 1  SF-rt2-f2.geo.net (166.90.2.13) [AS3356 - GeoNet Communications, Inc.]  1.56/1.94/3.34 (0.418) ms  20/20 pkts (0% loss)
 2  SF-core1-f0.geo.net (166.90.5.4) [AS3356 - GeoNet Communications, Inc.]  1.27/2.4/10.3 (1.92) ms  20/20 pkts (0% loss)
 3  MAE-West-h0.geo.net (166.90.1.34) [AS3356 - GeoNet Communications, Inc.]  3.99/5.12/10.2 (1.29) ms  20/20 pkts (0% loss)
 4  198.32.136.42 (198.32.136.42) [AS701 - AlterNet route - AS 701]  4.75/6.96/12.4 (2.7) ms  20/20 pkts (0% loss)
 5  118.ATM11-0-0.XR1.SJC1.ALTER.NET (146.188.144.138) [AS702 - UUNET-NET]  5.2/6.35/9.95 (1.9) ms  20/20 pkts (0% loss)
 6  193.ATM2-0-0.XR1.SCL1.ALTER.NET (146.188.144.145) [AS702 - UUNET-NET]  7.49/14.7/68.8 (13.1) ms  20/20 pkts (0% loss)
 7  100.ATM2-0-0.TR1.SCL1.ALTER.NET (146.188.145.226) [AS702 - UUNET-NET]  8.47/9.95/16.7 (1.78) ms  20/20 pkts (0% loss)
 8  107.ATM8-0-0.TR1.DCA1.ALTER.NET (137.39.104.2) [AS701 - UUNET]  69.4/72.5/81.2 (3.0) ms  20/20 pkts (0% loss)
 9  100.ATM5-0-0.XR1.DCA1.ALTER.NET (146.188.161.53) [AS702 - UUNET-NET]  71.1/86.7/280 (45.8) ms  20/20 pkts (0% loss)
10  195.ATM8-0-0.XR1.TCO1.ALTER.NET (146.188.160.106) [AS702 - UUNET-NET]  70.9/73.8/82.7 (2.58) ms  20/20 pkts (0% loss)
11  193.ATM5-0-0.GW2.FFX1.ALTER.NET (146.188.160.209) [AS702 - UUNET-NET]  69.1/73.9/77.9 (2.71) ms  20/20 pkts (0% loss)
12  UUNET7-GW.UU.NET (137.39.12.162) [AS701 - UUNET]  72.9/75.3/87.0 (3.4) ms  20/20 pkts (0% loss)
13  www.uu.net (199.170.0.30) [AS701 - UUNET Technologies, Inc.] 71.3/74.2/78.6 (1.83) ms  20/20 pkts (0% loss)

So this route looks pretty good. Let's try to debug a bad one. So as to not try to make any other specific ISP look bad, some hostnames and IP addresses will be changed to protect the innocent. Let's say you're connected to GeoNet via a T1, and you have another office in Chicago that is connected via a different ISP. One day you notice some definite slowness in transferring files and/or logging into machines at the remote site and you want to see where the problem lies. So you decide to do some traceroutes. A traceroute from your GeoNet connected office shows you:

traceroute to chicago4.mycompany.com (342.5.133.4): 1-30 hops, 38 byte packets
 1  router.SanFrancisco.mycompany.com (209.0.0.278)  3.52 ms  2.75 ms  2.63 ms
 2  some_interconnect.geo.net (166.90.420.231)  71.5 ms  3.71 ms  3.5 ms
 3  SF-core1-h1.geo.net (166.90.1.17)  3.23 ms  3.20 ms  3.25 ms
 4  MAE-West-h0.geo.net (166.90.1.34)  7.30 ms  13.7 ms  6.33 ms
 5  mae-west.other-isp.net (198.32.136.256)  21.0 ms  31.4 ms  29.7 ms
 6  core2.SanFrancisco.other-isp.net (254.70.100.245)  21.44 ms  32.2 ms  32.5 ms
 7  core1.Denver.other-isp.net (254.70.40.229)  73.1 ms  *  97.4 ms
 8  border3.Chicaco.other-isp.net (254.70.56.23)  62.3 ms  86.23 ms 53.88 ms
 9  my-company-t1.Chicago.other-isp.net (254.70.111.34)  120.43 ms  95.3 ms  86.44 ms
10  router.Chicago.other-isp.net (342.5.133.1)  *  *  112.42 ms
11  chicago4.mycompany.com (342.5.133.4)  132.34 ms  *  104.12 ms

So looking at this traceroute, you can see that there is some packet loss, but it's hard to tell exactly where it starts. It could be the link between hops 5 and 6, but it's hard to know for sure. So, being an educated tracerouter, you decide to do a traceroute from Chicago back to your office in San Francisco. You get:

traceroute to sf13.mycompany.com (209.0.0.267): 1-30 hops, 38 byte packets
 1  router.Chicago.other-isp.net (342.5.133.1)  3.85 ms  2.64 ms  4.15 ms
 2  my-company-t1.Chicago.other-isp.net (254.70.111.33)  5.16 ms  3.94 ms  7.22 ms
 3  border3.Chicaco.other-isp.net (254.70.56.23)  3.62 ms  4.28 ms  5.15 ms
 4  core1.Denver.other-isp.net (254.70.40.229)  25.8 ms  27.2 ms  23.7 ms
 5  core2.SanFrancisco.other-isp.net (254.70.100.245)  141.0 ms  *  49.7 ms
 6  pb-nap.geo.net (198.32.128.24)  123.43 ms  *  76.22 ms
 7  SF-rt3-f0.geo.net (166.90.354.7)  *  94.12 ms  102.32 ms
 8  some_interconnect.geo.net (166.90.420.232)  85.24 ms  *  97.3 ms
 9  sf13.mycompany.com (209.0.0.267)  117.31 ms  234.42 ms  99.19 ms

So now you have more to go on. First of all you see that this route is an asymettric one. The first route is 11 hops and the route back is 9 hops. Now the number of hops doesn't make any significant difference in how fast your connection is, but it can make things like packet loss and latency increases appear to be occur between two hops when it really isn't there. This is because the packet loss or increase in latency might be between two hops you don't even see because the route back to you is completely different.

So now you can make an educated guess as to where the packet loss might be occurring. Based on the first traceroute, it looked like the bad link might be between core2.SanFrancisco.other-isp.net and core1.Denver.other-isp.net, and by looking at the route back in the other direction, it appears that this assumption might be correct. At this point, your best bet it to copy and paste your traceroutes and send them into your ISP. Armed with this kind of information, your ISP will now have a lot better chance of tracking down the problem than if you just sent them an email saying "my connection to my Chicago office is slow." It also gives you a better understanding of how traffic is exchanged on the Internet.

Summary

In summary, traceroute is a network diagnostic tool that will show you the hops your Internet traffic takes from your host to a remote location. It will also tell you how long it takes for packets to get from your host to each hop as well as if packets get lost along the way, which can be useful in tracking network problems. Since routes on the Internet are often asymmetric, it's usually a good idea to do traceroutes in both directions if possible when trying to debug network slowness. In doing so, you can provide your ISP with crucial information that can help them to fix the network problem.