Sometimes, a server can be overloaded, and a perfectly valid URL may return with a failure for
wget. For this case, good practice (which you should implement) is trying multiple times with a small
sleep in the middle. In other words, with the sleep you’re simply trying to make sure the server has returned to a normal state. For my code, I tried three times with a sleep of two seconds in between each try.
After reading the lab assignment (seriously, go read it. Twice. Thrice.), you’ll find out that the data structures required for this lab are very similar to the ones used in Crawler, albeit with slight modifications.
Some students have already remarked on this in Piazza. In case the assignment was not clear enough, I’ll say it once more here: this is precisely why we need to refactor our code. Below is the discussion I had with the student on Piazza.
lab5 needs a dictionary, a word node list, and document list for each word node. I am asking if DICTIONARY struct and DNODE struct defined in crawler can be reused in indexer. I guess we don’t have to define a new WordNode which essentially have the same elements as in DNODE. But we do need to define a new document node struct, because document nodes are linked together, unlike the url node.
And here is my response.
You’re correct that they have essentially the same elements. :) Hence, we have to refactor our code.
Any time you see that you’re using basically the same thing with very slight modifications, you have to wonder if you can refactor your code to more efficiently use the same thing for all purposes.
So you’ll need to change / refactor old code such that you create a generic Node structure.
But how can that work? Remember that in C, void pointers can point to anything; and since the data is always a void pointer… you see where I’m going with this.
So specifically, you need to develop a single node struct that can be used for the entire search engine, and get your crawler working once again with this new structure breakdown.
If you’re thinking this is a lot of work, you’re right. Refactoring code is not an easy process, and it should be done asap before code gets unwieldy.
This should clarify what we mean by refactoring your code. Make your data structures that were highly specialized for crawler into generic ones that can be used across the entire search engine project. Then, get your crawler working once more with the generic data structures.
Does this mean I need to re-test my crawler? Of course. That’s why you made a testing script for it.
I’ve changed the directory where you should be testing your crawler. Previously, my website redirected you to its custom 404 page that I set up; this, unfortunately, is not kosher with
wget. So I’ve moved it to a different directory where this problem will not occur. The URL is http://chanderramesh.com/crawlerTesting .
You don’t have to turn in crawler again for this lab. So if you’re feeling especially strapped for time, just turn in a refactored version of the indexer and refactor crawler later. Note that this is a requirement for the next lab, so you’ll have to do it at some point…
Sometime this weekend I will put a tarball of some sample web pages you can download and the results you should get when running your indexer on this. To test your code, run the indexer on the web pages you download from me (the tarball), and then run
diff between your results and mine. If
diff outputs nothing, then they are the same, and your indexer likely works!
All testing material here has been deprecated. See the Lab 6 announcement for how to test all parts of your Tiny Search Engine.