Archive for the ‘Uncategorized’ category

Troubleshooting networking

2008-08-27-Wed

a.k.a. I F***ING HATE COMPUTERS WHEN THEY DON’T WORK RIGHT AND I CAN’T FIX THEM

Introduction

Up until now, I’ve almost always been running the DrProject server and the web browser to drive it on the same machine. For reasons I will not go into, I needed to run DrProject on an Ubuntu Linux installation on one computer and run the web browser on Windows on another computer. These two computers were on the same LAN. Let’s call these computers Lin and Win, respectively. Please don’t mind me personifying these machines.

When Firefox on Lin talked to the DrProject being hosted locally, everything worked fine. When Firefox on Win talked to the DrProject server on Lin, each page request took an excruciatingly long time to complete — something in the order of tens of seconds, rather than a second or two.

Coming up is a chronicle of following numerous false leads, running up and down the stairs, and diving into the depths of computing to debug problems at a low level. It unfolded over the course of about 3 hours, and is told in approximately chronological order.

Sanity checks

First, I tried accessing DrProject using Internet Explorer (on Win) to made sure that it wasn’t Firefox’s fault. Indeed, it wasn’t. Second, I tried accessing the server using a different Windows computer. Still no luck. I even tried using telnet[0] to access the HTTP server, where I still received a much-delayed response.

The puzzling thing was that Lin had no delays in accessing Internet resources, nor LAN resources (e.g. SMB into the Windows boxes).

In retrospect, if Windows was the problem, then I wouldn’t have caught it because I didn’t have another Linux computer on hand.[1]

Networking

Maybe the network was to blame. All the computers in question were hooked up to a home router (“residential gateway”), and maybe it was causing problems. So I hooked up Lin and Win directly (on a single cable), and set up a static IP configuration for each. Even after back-and-forth tweaking and checking, this didn’t work. So I produced my old 10 Mb/s Ethernet hub, and connected these two computers to the hub. After more IP configuration tweaking, I finally got these computers to talk to each other again.

Mind you, all of this was complicated by the fact that Lin and Win were located one floor apart (yay for running up and down the stairs). Also, Lin had 2 network interfaces[2], and it was difficult to tell which one was mapped to device eth1 and which was mapped to eth2. I did try enabling and disabling each of these two interfaces.

… And there was no improvement in the Win-browser-talking-to-Lin-server lag.

Enter Wireshark (formerly Ethereal)

By now, I had dragged my friend Roy into the problem, as he is a Linux expert and a competent programmer. Other than mildly disagreeing about my choice of screen capture software, he suggested right off the bat that I analyze the network traffic with Wireshark, a packet sniffer. Since I used the older Ethereal back on Windows, I promptly set off to install Wireshark through the Debian package system without hesitation.

In the meantime, I tested using Win to connect a server socket on Lin to see if that was a problem. Using telnet[3] on Win and a netcat listener[4] on Lin, I saw no delay at all. That was weird. More and more, it looked like “tracd”, the built-in HTTP server in DrProject (inherited from Trac), was where I should have pointed my blame.

When Wireshark was downloaded and installed, I fired it up and captured the relevant network interface. I changed the DrProject server port from 8080 to the standard 80, and used the capture filter “tcp port http” to reduce the amount of data captured.

What immediately caught my eye was that while most rows were coloured green, quite a number of them were coloured black. Wireshark explained that the black rows represent packets whose TCP checksums were incorrect, giving the reason ‘maybe caused by “TCP checksum offload”?’.

Digging a bit deeper, these “black” packets were all from Lin to Win (but not all packets from Lin to Win were black), and they all had a size of 1514 bytes. But wait a minute! Isn’t it true that the Ethernet MTU is 1500 bytes? Wouldn’t it be bad to exceed the MTU? So I consulted Wikipedia and RFC 1191, but to no avail. They confirmed that the Ethernet standard MTU is 1500 bytes, but didn’t mention whether that included Ethernet headers or not.

Through Wireshark’s detailed analysis of the raw 1514 bytes of data, I concluded that the 14 bytes were for Ethernet headers — 6 bytes for the MAC address of the sending machine, 6 bytes for the MAC address of the receiving machine, and 2 bytes for the EtherType[5]. With a bit of ifconfig wizardry[6], I set the MTU to 1486 bytes so that after adding the Ethernet headers, the total comes out to 1500 bytes. But I still had no luck with the slow connection problem!

Following the TCP conversations through Wireshark did not prove of much use either. But I did see that it took about 10 seconds to respond to each HTTP connection.

strace

Besides Wireshark, the other major recommendation from Roy was to run the DrProject server under strace. This program invokes another program, and traces (prints) all of the Unix system calls that the child program makes.

Roy asked me to inform him of the last line printed by strace whenever the output stopped momentarily. I first noted that the trace stopped at “wait4(-1,” after starting the server. He said that it was because the target process was launching a child process.

I realized that to make this trace work, I had to disable DrProject’s auto-reloader. Which actually required me to hack the code, because it seemed the “–auto-reload” command line option was not parsed and handled correctly. In addition to that, the Python BaseHTTPServer was multi-threaded, so I had to invoke strace with the -f option, which resulted in significantly more trace output to the screen.

Finally, something interesting appeared. When Win talks to Lin, the trace stops at “read(4,” for a number of seconds, then resumes. Roy said that 4 is the file descriptor, and asked me to trace all the actions starting from the point where that particular FD (#4) was opened. This is what I found, with some irrelevant information removed:

socket(PF_FILE, SOCK_STREAM, 0) = 4
fcntrl(4, F_GETFD)
connect(4, {sa_family=AF_FILE, path="/var/run/avahi-daemon/socket"}, 110) = 0
lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
write(4, "RESOLVE-ADDRESS 192.168.0.108\n", 30) = 30
read(4,  <-- Hangs here for a few seconds

If you ask me, that data being passed to the write system call looked like a DNS thing, or perhaps an ARP thing.

Avahi

With the evidence provided by strace in hand, Roy pointed to the Avahi Zeroconf software as the source of the problem. He asked me to toggle it — that is, start it if it was not running, and stop it if it was already running.

Stopping Avahi[7] and trying to access DrProject server on Lin from Win’s Firefox again… Success! W00t! 3 hours of intense troubleshooting had finally come to an end.

Conclusion

Was there a point in writing this? I don’t know. I don’t think you, the reader, can learn much from it. This article is comparable to a rant, with a hodge-podge of disconnected personal events and ideas. An honest and objective rant, though.

But I think this paints a picture of what computer problems look like, and how they are exhaustively troubleshooted and (hopefully) finally solved.

Oh yeah, and I took up about 1.5 hours of Roy’s time while he was at work. He has a flexible schedule as a software developer at a big company, so no worries. =P

Footnotes

Lin had the IP address 192.168.0.101.

Win had the IP address 192.168.0.108.

After Lin and Win could talk to each other, I would sit at Lin’s console and operate Lin while also hacking into Win using Remote Desktop Connection. So I didn’t run up and down between computers that much.

[0]: telnet 192.168.0.101 8000. Type “GET / HTTP/1.1 \r\n Host: 192.168.0.101 \r\n \r\n”.

[1]: Then again, I could have booted Ubuntu from a live CD.

[2]: One integrated on the motherboard, and one on a PCI card.

[3]: telnet 192.168.0.101 8000

[4]: netcat -l -p 8000

[5]: In this case, it was 0x0800 for IPv4.

[6]: ifconfig eth2 mtu 1486

[7]: sudo /etc/init.d/avahi-daemon stop

Advertisements

Simplified admin menu and enhanced users/projects pages

2008-08-08-Fri

As per Liz’s suggestions (#0, #1, #2), I have simplifed the admin menu, and added statuses and actions to the list users page and the list projects pages. (Implemented in changeset 5663.) Screenshots:

The reduced admin menu:

Manage Users page:

Manage Projects page:

Manage Roles and Capabilities page:

Easy examples of redundancy

2008-07-29-Tue

The following are some examples of simple redundancies in procedural programming. They are given in Python. I typically run into these when I simplify my code after writing it for the first time or after making edits.

Using if-else to assign a Boolean:

if condition:
    x = True
else:
    x = False

Simplified:

x = condition

(Requires that condition is a Boolean.)

Returning immediately after an assignment:

y = value
return y

Simplified:

return value

Common code among branches:

if condition:
    group of statements A
    group of statements B
    group of statements D
else:
    group of statements A
    group of statements C
    group of statements D

Simplified:

group of statements A
if condition:
    group of statements B
else:
    group of statements C
group of statements D

Nested if-statements:

if condition0:
    if condition1:
        statements

Simplified:

if condition0 and condition1:
    statements

Requires that the outer if-statement does not have an else-clause.

If-statements with the same actions:

if condition0:
    statements (same)
if condition1:
    statements (same)

Simplified:

if condition0 or condition1:
    statements (same)

Requires condition0 and condition1 to be disjoint, or statements to be idempotent.

Department of redundancy department: Python lambdas

2008-07-29-Tue

A coworker recently pointed out the lambda (anonymous function) in this piece of code that I wrote was redundant:

project_key_regex = re.compile(r"^project(\d+)$")
for key in filter(lambda k: project_key_regex.match(k), req.args.keys()):
    ...

It would be improved like this:

project_key_regex = re.compile(r"^project(\d+)$")
for key in filter(project_key_regex.match, req.args.keys()):
    ...

Apparently, methods of an object can be thought of as functions that have one variable bound to the object.

A bit of info about the phrase “Department of Redundancy Department” on Everything2: http://everything2.com/title/Department%2520of%2520Redundancy%2520Department

The revised create users and create projects doublet

2008-07-28-Mon

A picture is worth a few hundred thousand bytes, so without further ado, this is what’s new:

Although it isn’t apparent from the screenshot, a new memberships row (with “select a user/project” and “select a role”) is added whenever the final row is filled.

Compare these to the interfaces that we had before:

Create Multiple Users: Error handling

2008-07-23-Wed

The new Create Multiple Users page has a back-end that gives information on all sorts of errors:

Great bug writers think alike

2008-07-10-Thu

a.k.a. “A subtle bug from a perfect storm of HTTP, HTML, JavaScript, and DOM

My current work

I’m currently in the midst of implementing a new interface for creating new users. This is how it currently looks like:

The project membership table is created on the client side using JavaScript. The administrator adds and removes memberships (rows) without submitting the form.

HTML forms and HTTP, revisited

(Recall A simplified overview of HTML forms.) When the form is submitted over HTTP, all that is sent is a dictionary of key-value pairs. Each form element (text box, checkbox, select, etc.) generates at most one item for the dictionary, where the key is the name of the element (given by the name attribute) and the value is the string value of the element (“alice”, “pA55w0rd”, “on”, “viewer”, etc.).

Perhaps surprisingly, the dictionary’s keys need not be unique. For example, if there are two text boxes named “foo”, one with the value “bar” and one with the value “baz”, then the submitted form data would be “foo=bar, foo=baz” (or possibly “foo=baz, foo=bar”).

The tricky stuff: Client side

So for the form pictured above, the submitted information will be something along the lines of “username=roule, email=roulette@localhost, realname=Roulette, affiliation=Pani, password=01234567″… But wait! What about the membership table?

The first (but incorrect) idea that comes to mind is to give the same name for all elements in a column. For example, all the project selects would be named “project”, and all the role selects would be named “role”. The problem with this is that in the presence of multiple “project” and “role” keys, we don’t know which one was assoicated with which. A concrete example: Suppose John is a viewer in Alpha and a developer in Beta. The submitted form data would look like “project=Alpha, project=Beta, role=viewer, role=developer”, but with the entries being presented to the server in an arbitrary order. So it would be impossible to associate projects and roles uniquely.

Clearly then, the solution is to give each field a unique name. A simple method is to add a counter after the name – like “project0”, “project1”, “role0”, “role1”, etc.

The tricky stuff: Server side

How would the server process this information when it’s submitted? Assume that the client does not send the count of the number of rows. One way for the server to handle the variable number of rows is to start from a prearranged fixed number – 0 in this example – and process form fields and count upwards until the first n such that the submitted field “projectn” does not exist.

So far so good, right? But notice that on the client side, membership rows can be removed. Suppose that for simplicity, we don’t re-enumerate the remaining form fields. So if we had “project0”, “project1”, “project2” and then removed the middle entry, then we’d be left with “project0” and “project2”. Clearly, this interacts badly with the server side handler just described, since “project2” will be ignored. (Because “project1” does not exist, the server stops processing.)

Solutions

There are two relatively simple solutions to this problem:

  • Keep the client-side behaviour and modify the server-side behaviour. For each new row, give an auto-incrementing numerical suffix for the form fields. Do nothing special when removing rows. On the server, iterate through all form fields that have a name of the form “projectN“, where N is a number string. Ignore “gaps” in the numbering.
  • Modify the client-side behaviour and keep the server-side behaviour. When a row is removed, re-enumerate the rows that follow it. This is the solution that I picked, for no particular strong reason.

Great bug writers think alike

Now, if this problem were merely hypothetical, I wouldn’t have a good reason to write this rant-like-lengthed post about it. But it does already exist in DrProject! Namely, it manifests itself in the “Membership Preferences” section of the Edit User page and also on the current user’s Preferences page.

The offending server-side form handler code is found in register.py. (The complicated process of pinpointing this piece of code will be another story.)