Recitation 3

More Debugging

Debugging is like being the detective in a crime movie where you are also the murderer.

Felipe Fortes, technical lead at Flipboard

Debugging, while frustrating, is absolutely essential. Figuring out what your crazed, sleep-deprived self of yesterday meant to do and what your code is actually doing is no simple task. So we’re discussing some more debugging tips before you start building this search engine.

Tip 0: Read the GDB Notes

I mean Professor Campbell’s this time, particularly the part of patching code while inside of GDB. This is super awesome. Learn it. Not having to quit GDB to make every little change to your code is fantastic, and it will save you lots of time in the long run.

Step through the examples in both the notes as well as the previous recitation yourself and see if you can really employ the practices. I urged you to do this in the lab as well, and I hope you followed my advice. Moving on.

Tip 1: Small Functions

Why do we split things up into multiple functions? Why not have one giant main method? It’s to save the programmer’s time. Anytime you see code appearing in multiple places, you should factor it out and write a single function to do this. C, unlike many other places, has much much smaller function call overhead, so you’ll find this won’t impact your performance much.

But besides all of the usual bull-**** you hear, why do you really write many functions? Let me be devil’s advocate for a moment. I don’t care that I have to type out lots of things. I have the power of Vim and regular expressions! I can copy-paste content onto fifty files with a single Unix command! Hell, I won’t even write loops. I’ll literally write the same loop body fifty times and manually decrement i for all I care. Come on, give me a better reason.

Well, there are really two main reasons two write small functions besides the whole saving programmer time / code design.

  1. Readability. Yup, small functions is the best way to improve readability, and I don’t care what someone else told you. Better than comments, better than good variable names. Reading a function that’s ten lines of code just makes life easy. It allows further abstraction. Once I understand each of the tiny functions, putting them together is simple. This is super helpful for anyone who will read your code later one - i.e. yourself at a later point in time.
  2. Testing. Testing. Testing. How do you debug your code? So far in your computer science career, more than likely you’ve simply been running your code and hoping it works. Do you have a rigorous testing protocol? What is it? Enter the world of unit tests.

Tip 2: Code, Don’t Write - Test Driven Developoment

This really is a part of the previous point, but it’s large enough that I decided to allocate a separate bullet point for it.

There are lots and lots of different kinds of tests. There are unit tests, integration tests, smoke tests, regression tests, acceptance tests, build tests, and many many more. People dedicate their whole lives to testing code - yes, it is that important.

But since this is your first introduction to standardized testing, we’ll start with unit tests. The purpose of a unit test is to test a very narrow segment of code - in Java, it would be to test a single method of a class. In C, it would be to test a single function.

Designing tests in a rigorous manner to really make sure your functions work is super important. In fact, there is a whole group of people who write their tests first. By writing the tests first, you plan out the entire design of the program down to what functions will do what. It allows you to think abstractly about the problem and focus on high-level design choices. Then, you go ahead and write the individual functions. It’s just like writing an outline and then filling it in to create an essay. This approach is called Test Driven Development. It’s all the rage these days in industry.

But we’re not in industry, so I won’t mandate you do it this way. Suffice it to say that I’m thoroughly convinced, and these days even for school projects I employ TDD. The most common concern by testing newbies is something of the following sentiment:

“Well okay, I have to in addition to writing the code I would normally, write all these tests that have to be all rigorous and thorough, yo. Talk about wasting time.”

Listen well “yo”. This is work you’ll have regardless - it’s just that you’ll be doing it in pieces throughout rather than all at once. By doing it all at once, you’re forced to think about your design and implementation - and hey by the way, it’s been proven by many a person smarter than you or I that TDD saves time in the long run. It’s become standard industry practice.

The real reason people run into so many issues coding is that they treat it like writing an essay. When writing a paper, often times you breeze through the entire paper with a rough draft. Then you go over it several more times in phases of editing, and eventually a polished paper comes out. This is not how to approach coding at all. If you think you can write the entire assignment and just “edit” your way through it, you’re gonna have a bad time. Write a small function, make sure it works. Move onto the next small function. Rinse and repeat.

Tip 3: Unit Testing

So I hope the immensely long rant convinced you that testing your code is absolutely essential. If you’re ever going to be doing industry standard coding (getting a job, doing an internship, even doing open source volunteer work,) submitting the tests along side your code is expected, not viewed as a bonus. So let’s look at doing some actual unit tests!

Admittedly, unit testing in C is slightly more difficult than Python’s nose framework or the very famous Java JUnit. The Google Testing Framework is in my opinion the best one - some may argue that Check is better, but I disagree. Regardless of which one you use, unit test your code. Today’s recitation, however, will focus on the incredibly simple MinUnit.

The entire “framework” is literally just 3 lines of code – 4 because one line is wrapped, and all of it is contained within minunit.h. It doesn’t get better than this. It’s my hope that by presenting the most simple framework in the existance of the web, all of you will do some sort of testing of your code.

/* file: minunit.h */
#define mu_assert(message, test) do { if (!(test)) return message; } while (0)
#define mu_run_test(test) do { char *message = test(); tests_run++; \
                                if (message) return message; } while (0)
extern int tests_run;

That’s it! That is the entire framework. So how do we use it?

Let’s go through an example. Here is my obviously buggy swap.c C file that I want to test:

void swap(int *a, int *b) {
    return;
}

And here is the corresponding header file:

void swap(int *a, int *b);

Let’s test it! For this extremely introductory example, let’s just do one single test case. I’ve put all my tests in test.c:

#include <stdio.h>
#include "swap.h"
#include "minunit.h"

int tests_run = 0;

int a = 5;
int b = 4;

static char *test_swap() {
    swap(&a, &b);
    mu_assert("error, a != 4", a == 4);
    mu_assert("error, b != 5", b == 5);
    return 0;
}

static char *all_tests() {
    mu_run_test(test_swap);
    return 0;
}

int main() {
    char *result = all_tests();
    if (result != 0)
        printf("%s\n", result);
    else
        printf("ALL TESTS PASSED\n");
    printf("Tests run: %d\n", tests_run);

    return result != 0;
}

Couple things to note here. Obviously, whatever file is doing the actual testing will have to #include "minunit.h". Feel free to include this file in your SVN repo.

Next, you’ll notice the actual testing file has to do a bit of setup to create the data that you will test. This is common practice, and it leads to testing files being a bit long - but luckily it’s not too difficult to create it.

Lastly, the all_tests function here is rather unnecessary since we only have a single test; however, if we had many tests, it would be nice to put all the calls within a single function.

Okay, let’s run it.

error, a != 4
Tests run: 1

Bummer. It doesn’t work. Well, the fix is simple enough:

#include "swap.h"

void swap(int *a, int *b) {
    int temp = *a;
    *a = *b;
    *b = temp;
}

And the new output:

ALL TESTS PASSED
Tests run: 1

If you are more interested, here are further readings.

Tip 4: Advanced GDB

These are two things I forgot to mention in my actual GDB tutorial!

Debugging segmentation faults with GDB is super easy. Use the backtrace or bt command if you’re running the executable within GDB. If not, don’t worry. Usually segmentation faults will produce core dumps, which produce a .core file. Usually there will be a series of numbers after the word “core” in the filename, but choose the latest one (otherwise you’ll be debugging a wrong core dump!).

The other really handy trick is attaching GDB to a process that’s already begun. Once you have the process ID to which you wish to attach, simply call gdb -p PID.

GNU Make

One last thing to talk about before we discuss the details of the lab: GNU Make.

For this lab, you’ll have many source files and many header files. Imagine you have some of the following completely made up files: dictionary.c, dictionary.h, crawler.c, crawler.h, url.c and url.h. Now, every time you compile, you have to do mygcc dictionary.c url.c crawler.c, and if you made a special debugging version, you have to do a different command. Then all those .o and .gch files created may have to be cleaned up after your test run. There should be a way to automate this process. Enter GNU Make.

GNU Make is an awesome tool used to create configurable executables quickly an easily, including debuggable versions of those executables, and remove any ‘junk’ created afterwards (think .gch, .o, and other files).

Here is an example make file to help you get started.

# Lines that start with "#" are comments
# Filename
# Description / Purpose
# Any specific warnings to build special files, should they apply

# Which compiler? This should be gcc
CC=gcc

# Any params you'll pass to gcc
# I have 2 sets of params - one for normla, one for a debugging version
CFLAGS=-Wall -pedantic -std=c99 -O3
DEBUG_FLAGS=$(CFLAGS) -ggdb

# What are the relavent .c .h files
SOURCES=./crawler.c ./crawler.h ./dictionary.c ./dictionary.h ./url.c ./url.h

# Here are the make commands
crawler:$(SOURCES)
    $(CC) $(CFLAGS) -o crawler $(SOURCES)
debug:$(SOURCES)
    $(CC) $(DEBUG_FLAGS) -o debug $(SOURCES)

clean:
    rm -f debug
    rm -f crawler
    rm -f *.o
    rm -f *.gch
    rm -f *#
    rm -f *~
    rm -f *.swp

Now, if I call make clean it will execute all the statements below it. If I call make debug, it will create an executable called debug that I can use GDB on. No more long complicated gcc commands - it’s all done for you!

The directory to find your source files (notice ./crawler.c instead of crawler.c) is rooted at your Makefile. So if crawler.c is inside another folder foo, the Makefile would have said ./foo/crawler.c.

The most common source of bugs is that the indented lines underneath each command like crawler or debug or clean must use a tab - not spaces. Do soft tabs screw this up? Try it and find out. :) You may be surprised.

You should of course include in your README how someone else should build / use your Makefile, and any special instructions. Obviously, submit your Makefile to SVN as well.

The Crawler Lab

Okay, now that we have our toolbox filled to the brim with new knowledge and tips, let’s talk about the lab.

This is not an easy lab. I’ll say it again - this is not an easy lab.

Start early, get it working. You’ll find your future labs depend on this one being a success. Do I expect all of you to turn in a bug-free lab this time? Nope. And I’m not even going to point out all of the bugs. Your job over the next four weeks is to build a cohesive search engine. If that means your query engine breaks because of some small bug you wrote into your crawler three weeks beforehand then so be it.

You should be thinking of this as one large lab. In fact, part of your later assignment will be to come back to this lab and make it squeaky clean - no memory leaks, optimize this, etc. But we’ll talk about tools like valgrind and gprof later. For now, let’s talk about the design pattern for this assignment.

You’ve actually seen the implementation and design patterns for this assignment in class already, and most likely Professor Campbell has spoken about this at length. There’s not much discussion here - you’re all implementing the design we laid out in class, end of story.

How does a crawler work? We’re given a starting URL, often referred to as the seed. We crawl until we hit a predetermined maximum depth value. The seed is given a depth of 0 by convention. All URLs found in the seed are of depth 1, and so on.

So what data structures do we need?

These are of course just suggestions, but I think they’re very good ones that we should all use. The first one is obviously some sort of ordered data structure we have to implement; the real question is what order should this be. Should it be a stack? A queue? Something else?

Well, this depends on your implementation of the crawler. I would do something in the style of Breadth First Search: crawl everything with depth 0 first, then crawl everything with depth 1, and so on. This would mean a standard queue. For the actual implementation, you need something that allows you to pop off the front quickly and something that allows you to add onto the end very quickly. What this becomes is up to you. The obvious solution is a linked list. If you have a better, non-obvious solution, feel free to implement that as well.

For the second data structure, we have some choices. But there will be many, many URLs that we are searching for. One choice - one that optimizes for time complexity, rather than memory - is a dictionary (or a HashMap for the Java crowd).

How do dictionaries (or hash tables) work? A dictionary is a data structure that, in the worst case, takes O(n) lookup time. So why use them? Hell, a simple array takes O(n) worst case lookup time. Why not just use those? Well, most of the time, a dictionary takes constant lookup time.

Let’s imagine a dictionary as a simple array with length 1000. Now let’s say we want to insert a string "Hello world" into this dictionary so we can easily find it. We take a hash function, apply it to a string, and it gives us a value. Good hash functions always provide unique results for unique inputs - that is, they should have no collisions. If two different strings "goodbye" and "hello" both had a hash value of 6, well then that wouldn’t be a very good hash function would it?

The hash function we’ve given you is a pretty decent one - feel free to use a better one should you find one online (of course you will cite it…). Great! So we have some unique hash value as a result of this awesome mathy hash function. Now what?

Well we can’t just use that as an index automatically - remember how our dictionary is just a 1000 element array? So how do we fit this integer into our array? How about the modulus operator. By modding the output of the hash function with the size of the dictionary, we get some value between 0 and 999 inclusive - the size of our dictionary.

So we can use this value as our index into the dictionary. So… is that it? No! The modulus function artificially creates collisions! Think about it. Imagine we have a perfect hash function which never creates the same value given two different inputs. Great. So we have two values 1000 and 2000. Well, when we mod them both by 1000, won’t they both be assigned to 0?

Of course. So what do we do now? Well, what if we changed our data structure from being a simple array of ints to an array of linked lists. Now that each element in the array contains a linked list, we can assign both both of those hashes to the same index in the linked list.

But this adds another wrench into our problem. What if we wanted to find out whether the string "goodbye" was in our dictionary? Well, we’d input that value into our hash function, mod the operator by the size of the dictionary, and look in that index. But we can’t just assume that if the value is not null it will be what we’re looking for. Remember, there can be collisions - the string "hello" could have been assigned to the same value. So we iterate through the linked list and look for the string, and only if we don’t find it there can we conclude it is not in the dictionary.

Is it becoming clear why in the worst case a dictionary is also O(n)? What’s the worst case? Imagine a dictionary with the size of 1. Just a single pointer to a linked list. That means every call to the dictionary would have to look through the entire linked list - in other words, just like a normal array.

You will implement a hash table for this lab, so you’ll get to experience its inner workings first hand.

Fancier hash tables in languages like Java or Python actually dynamically adjust in size. When they see that the number of collisions has reached a certain threshold, they’ll increase their size from 1000 to 2000 (or whatever the size is) and then re-hash all their elements. This is naturally a rather time intensive operation, but the amortized cost is quite small (algorithms, people!).

Great, so now you’ve designed all your three data structures, the methods that go along with them, and you’ve unit tested all of them so you know the error cannot be within any of these files.

Awesome! If you’ve truly done all of these things, then the actual crawler.c file just has to be a single main method - in fact, it doesn’t have to be more than 25 lines of code!

1. Create all the data structures you'd use - your dictionary, your queue, etc
2. As long as your queue is not empty
    3. Pop off the first element from the queue
    4. Download the contents of the page the URL points to
    5. Search through all URLs in this page, adding them to the dictionary / queue if they're not already there and they're not beyond MAXDEPTH
    6. Store the depth of this URL, and all the URLs you found in it into a file

That’s it! There’s the lab. Go get ’em!