Traceroute is a very handy tool written by Van Jacobson that can show you the route that packets take from one host to another. It can also be used sometimes to help debug network problems, if you know how to interpret its results.
Let's take a look at an example traceroute:
traceroute to amber.Berkeley.EDU (128.32.25.12), 30 hops max, 40 byte packets 1 SF-rt2-f2.geo.net (166.90.2.13) 1.671 ms 1.02 ms 1.047 ms 2 SF-core1-f0.geo.net (166.90.5.4) 0.753 ms 1.606 ms 0.626 ms 3 MAE-West-h0.geo.net (166.90.1.34) 3.577 ms 3.79 ms 4.032 ms 4 sl-mae-w-F0/0.sprintlink.net (198.32.136.11) 5.437 ms 7.123 ms 3.89 ms 5 sl-bb2-stk-4-0.sprintlink.net (144.228.10.109) 7.773 ms 8.094 ms 8.434 ms 6 sl-bb11-stk-4-2-155M.sprintlink.net (144.232.4.69) 9.086 ms 10.504 ms 8.171 ms 7 sl-gw10-stk-8-0-0-155M.sprintlink.net (144.232.4.97) 8.336 ms 9.565 ms 8.434 ms 8 sl-ucberkeley-1-1-0-T3.sprintlink.net (144.228.146.50) 12.227 ms 10.739 ms 11.901 ms 9 f5-0.inr-666-eva.berkeley.edu (198.128.16.21) 22.43 ms 12.607 ms 12.243 ms 10 f1-0-0.inr-107-eva.Berkeley.EDU (128.32.2.1) 9.479 ms 15.837 ms 11.53 ms 11 f8-0.inr-100-eva.Berkeley.EDU (128.32.235.100) 11.978 ms 12.495 ms 10.85 ms 12 amber.Berkeley.EDU (128.32.25.12) 13.068 ms 12.883 ms 10.088 msAs you can see, there are 12 hops from the geo.net web server to the UC Berkeley web server (amber.Berkeley.EDU) and that the Round Trip Time from us to it appears to be roughly 10-13 ms (based on those 3 numbers on the last line: 13.068 ms 12.883 ms 10.088 ms). Keep in mind that the RTT's reported are the round trip times from the source host to THAT router hop. It's not a cumulative sum of the previous times or anything like that. Each hop is going to add some time to the path, so you'd expect each hop to take a little bit more time to get to than the last. Looking at this example, you can see that this is pretty much the case here, except for slight fluctuations on the orders of milliseconds due to network traffic.
Now an important thing to know when using traceroute is what the asterisks/stars mean. If you see traceroute print out a star instead of a round trip time, that means that either your probe packet got dropped, or the reply back to you for that probe got lost along the way. This is usually referred to as "packet loss," and we will discuss this later.
The first caveat to be aware of is that sometimes it will look like the last hop on a traceroute dropped a packet, when it really didn't. This is due to both the fact that this host is the actual final destination of your traceroute probes, and how certain Operating Systems handle ICMP. (ICMP, Internet Control Message Protocol, is one protocol that machines on the Internet use to send messages to each other, and the "Your packet died here" message that traceroute relies on is an ICMP message.) Since the last hop is your destination, instead of that host sending you back an ICMP message saying "Sorry your packet died here," that host will send back a different ICMP message saying "Hi, your packet made it here, but this port is unreachable." This is because traceroute purposefully sets the probe packet's destination to be some large port number that will most likely be unreachable at the destination host because it wants to receive that "port unreachable" message back. The caveat here has to do with the fact that some OS's, such as IOS (which Cisco routers run) and Sun Solaris, purposefully drop ICMP responses like "port unreachable" if it gets too many of them in a short period of time. They do this presumably as a security precaution. So, if you were to add in more delay between probes, you wouldn't see this erroneous packet loss.
Another caveat of traceroute is that ICMP, which is the protocol traceroute relies on to get responses from each hop, is usually the lowest priority protocol. So if one router is really busy it might decide to drop ICMP messages, and you will see lots of packet loss, but that router might be forwarding on more common, higher priority traffic just fine.
Also, some sites will filter ICMP for various reasons, so it might appear in a traceroute that a site might be unreachable, but in fact it is reachable.
traceroute to 209.0.0.210 (209.0.0.210): 1-30 hops, 38 byte packets 1 SF-rt5-fe9-0.geo.net (166.90.6.1) 0.48 ms 0.440 ms 0.378 ms 2 SF-core1-h1.geo.net (166.90.1.17) 0.618 ms 0.571 ms 0.521 ms 3 SF-rt2-f0.geo.net (166.90.5.7) 1.19 ms 1.94 ms 1.13 ms 4 * * * 5 * * *Just remember that such a traceroute can also be an example of a firewall that is filtering packets, or a router that throws away the kinds of packets that traceroute depends on when it gets overloaded.
For instance, host A might be on the west coast using ISP X, and host B might be on the east coast using ISP Y. The path from host A to host B will then probably exit ISP X as soon as it can, most likely at some peering point on the west coast and enter ISP Y's network from there onto host B. Conversely, the path from host B to host A will most likely exit ISP Y's network as soon as it can on the east coast, and enter ISP X's network and continue on to host A.
Here's an example:
traceroute to web-proxy.geo.net (166.90.90.163) 1 E40-RTR-E40-SERVER72-ETHER.MIT.EDU (18.72.0.1) 4 ms 4 ms 4 ms 2 EXTERNAL-RTR-FDDI.MIT.EDU (18.168.0.12) 4 ms 4 ms 4 ms 3 cambridge2-br2.bbnplanet.net (192.233.33.6) 4 ms 4 ms 4 ms 4 cambridge1-br1.bbnplanet.net (4.0.2.25) 4 ms 78 ms 105 ms 5 nyc1-br2.bbnplanet.net (4.0.2.85) 12 ms 12 ms 12 ms 6 nynap.bbnplanet.net (4.0.1.26) 12 ms 12 ms 16 ms 7 sprint-nap.geo.net (192.157.69.43) 94 ms 82 ms 74 ms 8 SF-rt5-a1.geo.net (166.90.4.33) 70 ms 78 ms 74 ms 9 SF-core1-h1.geo.net (166.90.1.17) 82 ms 78 ms 74 ms 10 SF-rt2-f0.geo.net (166.90.5.7) 273 ms 234 ms 98 ms 11 web-proxy.geo.net (166.90.90.163) 133 ms 90 ms 82 ms traceroute to BIG-SCREW.MIT.EDU (18.72.0.176), 30 hops max, 40 byte packets 1 SF-rt2-f2.geo.net (166.90.2.13) 1.218 ms 1.219 ms 1.479 ms 2 SF-core1-f0.geo.net (166.90.5.4) 0.704 ms 0.68 ms 0.678 ms 3 MAE-West-h0.geo.net (166.90.1.34) 3.926 ms 3.402 ms 4.285 ms 4 sanjose1-br1.bbnplanet.net (198.32.184.19) 5.071 ms 4.839 ms 6.973 ms 5 su-bfr.bbnplanet.net (4.0.1.10) 6.695 ms 6 ms 8.342 ms 6 chicago1-br2.bbnplanet.net (4.0.3.165) 71.597 ms 70.278 ms 70.166 ms 7 boston1-br1.bbnplanet.net (4.0.2.245) 76.612 ms 74.881 ms 75.66 ms 8 boston1-br2.bbnplanet.net (4.0.2.250) 74.099 ms 77.012 ms 76.715 ms 9 cambridge2-br1.bbnplanet.net (4.0.1.186) 75.399 ms 75.376 ms 74.932 ms 10 ihtfp.mit.edu (192.233.33.3) 78.895 ms 76.066 ms 76.434 ms 11 E40-RTR-FDDI.MIT.EDU (18.168.0.11) 77.556 ms 76.115 ms 75.627 ms 12 BIG-SCREW.MIT.EDU (18.72.0.176) 76.484 ms 76.226 ms 77.748 ms
Note the vastly different paths that these two traceroutes take from host A to host B and from host B to host A, each with a different number of hops. The first traceroute shows the path from MIT to geo.net goes through Sprint Nap, an exchange point in New Jersey. This makes sense, since MIT is on the east coast and BBN is using closest exit routing. The second traceroute shows that the path from geo.net in San Francisco back to MIT goes through MAE West, an exchange point in the San Francisco Bay Area, the closest exit point for geo.net.
Now, to make the issue more confusing, the second reason why tracking down network "slowness" is tricky is the fact that in networking there is no "slow" or "fast", but instead there are bandwidth and latency, which are two different concepts that can both determine how "fast" a network is. (If you are unclear on the difference between bandwidth and latency, check out a cool paper written by Stuart Cheshire called "It's the Latency, Stupid". It's a little technical, but don't be scared off by that, because it's good reading.) Or, for the Cliffs Notes version, read my paper on Bandwidth vs. Latency.
traceroute to www.uu.net (199.170.0.30): 1-30 hops, 38 byte packets 1 SF-rt2-f2.geo.net (166.90.2.13) [AS3356 - GeoNet Communications, Inc.] 1.56/1.94/3.34 (0.418) ms 20/20 pkts (0% loss) 2 SF-core1-f0.geo.net (166.90.5.4) [AS3356 - GeoNet Communications, Inc.] 1.27/2.4/10.3 (1.92) ms 20/20 pkts (0% loss) 3 MAE-West-h0.geo.net (166.90.1.34) [AS3356 - GeoNet Communications, Inc.] 3.99/5.12/10.2 (1.29) ms 20/20 pkts (0% loss) 4 198.32.136.42 (198.32.136.42) [AS701 - AlterNet route - AS 701] 4.75/6.96/12.4 (2.7) ms 20/20 pkts (0% loss) 5 118.ATM11-0-0.XR1.SJC1.ALTER.NET (146.188.144.138) [AS702 - UUNET-NET] 5.2/6.35/9.95 (1.9) ms 20/20 pkts (0% loss) 6 193.ATM2-0-0.XR1.SCL1.ALTER.NET (146.188.144.145) [AS702 - UUNET-NET] 7.49/14.7/68.8 (13.1) ms 20/20 pkts (0% loss) 7 100.ATM2-0-0.TR1.SCL1.ALTER.NET (146.188.145.226) [AS702 - UUNET-NET] 8.47/9.95/16.7 (1.78) ms 20/20 pkts (0% loss) 8 107.ATM8-0-0.TR1.DCA1.ALTER.NET (137.39.104.2) [AS701 - UUNET] 69.4/72.5/81.2 (3.0) ms 20/20 pkts (0% loss) 9 100.ATM5-0-0.XR1.DCA1.ALTER.NET (146.188.161.53) [AS702 - UUNET-NET] 71.1/86.7/280 (45.8) ms 20/20 pkts (0% loss) 10 195.ATM8-0-0.XR1.TCO1.ALTER.NET (146.188.160.106) [AS702 - UUNET-NET] 70.9/73.8/82.7 (2.58) ms 20/20 pkts (0% loss) 11 193.ATM5-0-0.GW2.FFX1.ALTER.NET (146.188.160.209) [AS702 - UUNET-NET] 69.1/73.9/77.9 (2.71) ms 20/20 pkts (0% loss) 12 UUNET7-GW.UU.NET (137.39.12.162) [AS701 - UUNET] 72.9/75.3/87.0 (3.4) ms 20/20 pkts (0% loss) 13 www.uu.net (199.170.0.30) [AS701 - UUNET Technologies, Inc.] 71.3/74.2/78.6 (1.83) ms 20/20 pkts (0% loss)So this route looks pretty good. Let's try to debug a bad one. So as to not try to make any other specific ISP look bad, some hostnames and IP addresses will be changed to protect the innocent. Let's say you're connected to GeoNet via a T1, and you have another office in Chicago that is connected via a different ISP. One day you notice some definite slowness in transferring files and/or logging into machines at the remote site and you want to see where the problem lies. So you decide to do some traceroutes. A traceroute from your GeoNet connected office shows you:
traceroute to chicago4.mycompany.com (342.5.133.4): 1-30 hops, 38 byte packets 1 router.SanFrancisco.mycompany.com (209.0.0.278) 3.52 ms 2.75 ms 2.63 ms 2 some_interconnect.geo.net (166.90.420.231) 71.5 ms 3.71 ms 3.5 ms 3 SF-core1-h1.geo.net (166.90.1.17) 3.23 ms 3.20 ms 3.25 ms 4 MAE-West-h0.geo.net (166.90.1.34) 7.30 ms 13.7 ms 6.33 ms 5 mae-west.other-isp.net (198.32.136.256) 21.0 ms 31.4 ms 29.7 ms 6 core2.SanFrancisco.other-isp.net (254.70.100.245) 21.44 ms 32.2 ms 32.5 ms 7 core1.Denver.other-isp.net (254.70.40.229) 73.1 ms * 97.4 ms 8 border3.Chicaco.other-isp.net (254.70.56.23) 62.3 ms 86.23 ms 53.88 ms 9 my-company-t1.Chicago.other-isp.net (254.70.111.34) 120.43 ms 95.3 ms 86.44 ms 10 router.Chicago.other-isp.net (342.5.133.1) * * 112.42 ms 11 chicago4.mycompany.com (342.5.133.4) 132.34 ms * 104.12 msSo looking at this traceroute, you can see that there is some packet loss, but it's hard to tell exactly where it starts. It could be the link between hops 5 and 6, but it's hard to know for sure. So, being an educated tracerouter, you decide to do a traceroute from Chicago back to your office in San Francisco. You get:
traceroute to sf13.mycompany.com (209.0.0.267): 1-30 hops, 38 byte packets 1 router.Chicago.other-isp.net (342.5.133.1) 3.85 ms 2.64 ms 4.15 ms 2 my-company-t1.Chicago.other-isp.net (254.70.111.33) 5.16 ms 3.94 ms 7.22 ms 3 border3.Chicaco.other-isp.net (254.70.56.23) 3.62 ms 4.28 ms 5.15 ms 4 core1.Denver.other-isp.net (254.70.40.229) 25.8 ms 27.2 ms 23.7 ms 5 core2.SanFrancisco.other-isp.net (254.70.100.245) 141.0 ms * 49.7 ms 6 pb-nap.geo.net (198.32.128.24) 123.43 ms * 76.22 ms 7 SF-rt3-f0.geo.net (166.90.354.7) * 94.12 ms 102.32 ms 8 some_interconnect.geo.net (166.90.420.232) 85.24 ms * 97.3 ms 9 sf13.mycompany.com (209.0.0.267) 117.31 ms 234.42 ms 99.19 msSo now you have more to go on. First of all you see that this route is an asymettric one. The first route is 11 hops and the route back is 9 hops. Now the number of hops doesn't make any significant difference in how fast your connection is, but it can make things like packet loss and latency increases appear to be occur between two hops when it really isn't there. This is because the packet loss or increase in latency might be between two hops you don't even see because the route back to you is completely different.
So now you can make an educated guess as to where the packet loss might be occurring. Based on the first traceroute, it looked like the bad link might be between core2.SanFrancisco.other-isp.net and core1.Denver.other-isp.net, and by looking at the route back in the other direction, it appears that this assumption might be correct. At this point, your best bet it to copy and paste your traceroutes and send them into your ISP. Armed with this kind of information, your ISP will now have a lot better chance of tracking down the problem than if you just sent them an email saying "my connection to my Chicago office is slow." It also gives you a better understanding of how traffic is exchanged on the Internet.
All Content Copyright (c)2000 by Greg Gardner greg@bah.org