Recitation 6

Welcome to the final recitation section! How quickly time flew! I will miss you all very dearly.

But before the goodbyes, we need to discuss two labs - the Query Engine and the sockets lab. First, as I’m sure you’ve heard, the sockets lab is optional. Why would you want to do it?

Of course, if you’re interested in none of these things, then feel free to skip the section about the final lab.

Query Engine

For this lab, you will have to turn in all portions of your Tiny Search Engine. This is the final package, so I expect all parts to be there and polished completely. This means

None of these are optional. Even unit tests. I expect a well documented README explaining how to run your unit tests, how to build your project, etc.

Once you get past all the other fluff requirements like refactoring, the actual deliverables for the Query Engine section of the Tiny Search Engine isn’t too bad. Essentially, you’re developing a rudimentary querying language. This last step is how Google makes all of its money - it matches the best by relevance, giving the best signal to noise ratio. But given that we only have a week, we’re doing to develop a rather crude algorithm.

This algorithm just has two operators: AND and OR. OR has higher precedence than AND. Feel free to develop past this requirement, but beware that it must meet at least this requirement. For full examples, see the code he’s given as examples.

This lab is essentially just testing your ability to code logically and use string manipulation functions. Not too bad after the indexer.

Process Vs Thread

What’s a process and what’s a thread? What’s the difference between the two?

As it turns out, everything. Recall the virtual memory diagram from the last recitation section.

This is not the virtual memory for your entire computer. Rather, this is a diagram representing the virtual memory for a single process. Each process has its own stack that looks much like this. That is, each process has its own stack, its own heap, its own data, and its own text (or code). This is important, so I’ll say it again. Each process has its own stack, its own heap, its own globals, and its own code.

A thread, on the other hand, is always associated to a particular process. It shares the heap, data, and code with the process to which it is attached. Instead, it has its own stack. That’s it.

So what are the benefits of threads? Remember that a thread has its own stack, but that’s it. It shares everything else.

So overall, threads are just more efficient. No two ways about it; no matter how you optimize your operating system, threads are always more efficient than processes. What are some of the disadvantages of threads?

The second point is rather elusive and is not necessary for this class, so I’ll be brief about it.

Threads are often implemented by using by a single core switching back and forth between threads, whereas processes can be fanned out to multiple cores – this is true parallelism. Is this faster? Well, depends on whether you are I/O bound or CPU bound, doesn’t it? This is why there is a raging debate as to how to resolve issues like Global Interpreter Lock. Because of this limitation of threads, interpreted languages like Python and Ruby don’t actually get faster when it comes to using threads, so you’re forced to use processes.

Race Conditions

If you didn’t understand that, don’t worry about it. For those of you who are even more curious, come talk to me about it later. The more important point is the first one: race conditions, otherwise known as a data race.

The fundamental question here is, since threads share the same heap and data, what happens when two threads try and access the same global variable at the same time? For example, thread A is trying to read from the memory, but at the same time thread B is going to write to it. Uh oh… That is a data race.

The point of race conditions is that you get non-deterministic outcomes. You will never be guaranteed that this is precisely the order in which threads will alternate because that’s up to the operating system and not you. So how can we ensure that we don’t accidentally mess with the same data?

Enter the mutex. A mutex, or a lock, is something that can only be acquired once; once acquired, it is considered to be “locked”. Upon releasing the mutex, it is no longer “locked”. So how do we use this?

We need a notion of a critical section here. A critical section is the segment of code where a thread is accessing common data that may be reused by other threads. So this may be the global data structure or what have you. The paradigm for such access is the following:

  1. Acquire the lock.
  2. Make changes to the critical section.
  3. Release the lock.

Some of you have taken algorithms so you are familiar with loop invariants. For those who have not, a loop invariant is a set of conditions that must always be true at the beginning of every loop iteration. The conditions may be violated during the loop, but at the end they must be restored to true once more. This is a very similar principle.

Before any thread accesses the critical sections, the invariants must be true that no other thread is accessing that section at that moment. If it’s not true, the thread simply waits to get the lock.

What happens when a thread forgets to release a lock? Uh oh… All the other threads will eventually just hang, waiting for a lock they can never acquire! The program will hang infinitely. This is known as deadlock.

There are all sorts of other nasty situations, like starvation or overfeeding, that can occur if you do not properly follow the paradigm set out for critical sections.

As a side note, there are other types of locks, but we won’t discuss them here. If you’re curious, there are two other popular ones known as semaphores and condition variables.

Nomenclature

One tiny note on language that is my personal pet peeve. The notion of “fork” only applies to processes. You fork a process; you do not fork a thread. You spawn threads, or create threads, or do any number of other things with threads, but you don’t fork them.

Sockets

I don’t have a whole lot to say about this last lab; it’s quite easy, and you’ve reached the stage in your CS career where I’m not going to give you psuedo-code outlines for labs. Read the lectures on processes vs threads and the sockets lecture as well as the lab assignment. You should have what you need from that. This lab truly isn’t so difficult.

One conceptual sticking point is the difference between pipes and sockets. Here is a great resource for this. Here is the sparknotes version:

A major difference between pipes and sockets is that pipes require a common parent process to set up the communications channel. A connection between sockets can be set up by two unrelated processes, possibly residing on different machines.

Because of this difference, you’ll find sockets being used for internet communication (where there are no common parent processes to set this up) while inter-process communication (generally on the same computer) is done using pipes.

Less important but still relavent are the large performance differences depending on OS and implementation.