(a very big nutshell)
As much as possible, I'm going to have to try and trim this down, but there's a limit to how much one can simplify the concept of Global Illumination (GI). I will try to work from more basic to more complicated, but in practice, it isn't quite fair to say that. Personally, I find path tracing simpler than raytracing even though it qualifies as a more advanced topic, but maybe that's just me. I'm going to assume for now that everyone who'd actually read this has somewhat of an idea about graphics programming and the math involved. And they're probably familiar with basic lighting computations and at least the basic concepts. Of course, my rhetoric may precede my judgment in how much people might know, but I'll try to make it introductory, anyway.
Let's first consider "what is the goal of illumination models when we're rendering?" As much as one might tell you otherwise, we're not really trying to simulate light as it really works. We're trying to get something that looks good enough to fool the eye. And there are a few wellknown, wellaccepted models to handle these operations. Though far fewer can really handle all sorts of light transport. The first illumination method that we all probably learn is raytracing and/or raycasting. Even the simple lighting we perform on our beloved polygonpushing engines to make them look pretty is really just plain old raycasting, albeit not necessarily performed at every pixel. Raycasting and raytracing are essentially the same, except that raycasting is what is referred to as only "the first hit" in the process of raytracing.
Of course, there's a simple problem with raytracing (or, more accurately, the common Whitted raytracer that we associate with the term); it is simply a solver for direct illumination. That leads us to defining what "Global Illumination" really is. Direct illumination as solved by a Whitted raytracer only solves a single interaction  that between objects and emitters of light. And more than that, the only type of light emitter it handles natively is point light sources. In real life, there's nothing in existence that really emits a significant amount of light from a source that is infinitesimally small. Anything that emits light in extents that we can really see actually takes up some amount of space. We can fake this by simply having a small distribution of point lights occupying an area, but we'd need many of them to get a decent result. This is not to say that simple direct illumination has no place when we're trying to create realistic renders. Pixar's Renderman, for instance, is strictly direct illumination and yet it has consistently produced quite fooling results. But to produce convincingly realistic rendering does mean that however we accomplish it, we need to simulate indirect lighting in some way or another. In old fashioned polygon rendering and simple raytracing, we fake this by using a constant "ambient" lighting term. All is not well with just a flat constant value. If nothing else, it makes a scene look dead. If one were to look at indirect illumination by itself with the direct illumination factor taken out, it can look almost flat in many cases, but there are subtle local variations, and more importantly, hue variations in light sources and objects within a scene are a major contributor to local variations in the indirect illumination.
Note that when I say "fooling the eye", we do have to take into account matters of relative significance. Our eyes are actually sensitive enough to capture light flux as small as 5 individual photons passing through to strike the retina. But this does not mean that we won't fool a viewer until we achieve that level of precision. One of the things our brains do for our vision is adjust to varying light conditions. We most certainly would not be able to see a 5photon flux in broad daylight. And yet even though the sunlight beating down is so bright as to blind us, when we look at objects in daylight, the lighting appears fairly even. Even with all that, we can look at a television screen and look at a comparatively dim and dull image of the sun, but be convinced that it is, in fact bright. This is in spite of the fact that no display technology in the world can come close to covering the entire gamut of human visual perception. And in the end, this is what affects how we reach our goal of fooling the eye. If we were to get too caught up in accurately simulating optics, we'll have results that may be numerically correct but aesthetically wrong.
See the world through the eyes of a polygon...
So what can we do to step up from the direct illumination of a raytracer? Well, perhaps the first thing we might think is "raytracing is no good for indirect lighting, let's look somewhere else." Which takes us to a GI algorithm which is probably both the most underrated and most overrated at the same time  radiosity. Radiosity is a lighting technique that is actually borne out of finite element methods for radiant heat transfer. As a finite element method itself, one would actually break up the scene into very small elements. Radiosity takes a somewhat backwards approach than one might immediately think. Rather than having paths from lightsources to surfaces, radiosity simulates diffuse light transport in terms of visibility between surfaces. i.e. Rather than putting lighting in terms how much light reaches a surface, we put it in terms of how much light the surface can see. Although, to be complete about it, there are "shooting" variants to radiosity, which do show their strengths over collecting types of radiosity in the fact that they affect more surfaces earlier.
A traditional radiosity implementation puts the question of visibility of light in terms of a massive matrix called the form factor matrix. Actually, the term "form factor" refers to individual pairings of surfaces and their respective fractions of light leaving one surface that reaches the other. The "form factor matrix" is simply the matrix that contains the full set of form factors. This matrix is step by step reduced, solving for an eigenvector until an equilibrium solution is reached. That is, there is the same amount of light leaving every surface as there is hitting it. In practice, this is very expensive as the scene gets larger and/or the element sizes are reduced. More importantly, let's not forget that we don't really need to go so far  we only need to get a good enough result that it fools the eye.
The form factor between two surfaces is expressed in the form 
F is the form factor between surfaces i and j. A(i) is the area of surface patch i. "v" is any vector connecting the two surfaces. Phi(i) and Phi(j) are both the angles between v and the respective surface normals. Note that the only thing that affects the transmission intensity is simply the cosines of the angles between light transport direction and the surface normals. This is just the same as Lambert's law which we all know and love. And that is one of the main weak points of radiosity  it only works with lambertian surfaces. Of course, this does give us one little advantage  the lighting we calculate by radiosity is viewindependent, so it only needs to be calculated a single time for a single scene. If we were to take the full set of all pairings and put them into a matrix, what we have is simply 
The "rho" terms are the reflectivity of each surface  in simpler terms, rho(i) is essentially the color of surface i. Some sense of reality does need to be taken into account when choosing these colors. There is no such thing in real life as a perfectly black or a perfectly red or perfectly blue surface. Any surface that looks blue will still reflect some amount red and green. Of course, in practice, an RGB color model often causes the color bleeding from indirect lighting to be overly exaggerated or extremely underrepresented. Serious implementations use spectral color models to remedy this, but regardless, this matrix, if multiplied by the nvector of all the radiosities of the n patches in a scene, will produce a result that is supposed to equal the nvector of the emissivities of all patches in the scene.
Well, there are better, more straightforward ways to compute radiosity, and generally, the most common approach (or set of approaches) is progressive refinement. This is where a simple approximation of lighting is made throughout a scene, and then all lightsources that were used in this approximation are taken out of consideration, and the lighting approximation is treated as a set of lightsources by which to compute another lighting approximation. The new lighting approximation then gets used as a set of new lightsources, and the process repeats itself. What is happening is that we end up iteratively improving the result and converging ever slowly towards an ideal solution. Of course, we can really stop plenty of iterations short of complete convergence and still have a result that looks good, and that's the main reason progressive refinement solutions are preferred.
The one type of radiosity form factor computation method that probably gets mentioned most is hemicubes. A hemicube is exactly what it sounds like; half of a cube. Really, it's no different if we use a hemicube or a hemisphere. The advantage of using a cube is the fact that when we render the scene from a patch's point of view, we can use the normal linear camera model we're so used to and maintain a certain, *almost* constant level of detail no matter what direction we sample in. With hemicubes, we simply render the scene from the viewpoint of every patch. Any sources of light, whether it is from an actual lightsource or a light reflecting off of a surface in the scene, will contribute to the lighting of the patch. The image we see is weighted in accordance to the relative cosines and relative distances of every pixel on the hemicube (this weighting is the same for any hemicube). This weighting not only corrects for the direction of incoming light, but also corrects for the "flatness" of the hemicube shape, which causes some distortion in the sampled image. The sum of these weighted pixel values throughout the image is the amount of light we see at this patch  it makes for a very simple approximation of the doubleintegral above.
1 iteration 
32 iterations 
320 iterations 
After one iteration, the results look pretty much like ordinary direct lighting. That's to be expected since only the surfaces that could see the emitting area light in the ceiling receive any light in the first pass. After 32 iterations, we see some color bleeding as a result of light reflecting off of the red and green walls. In addition, the previously gray walls look a little closer to white. The ceiling, which previously was dark, is now visible as a result of diffuse interreflections from other walls. Also, the yellow box is more visible on all sides, and its shadow has softened. However, with only 32 iterations, the edges and corners seem to pose problems. This is because they are immediately adjacent to other patches that would pretty much block out entire faces of the hemicube, thus preventing you from getting a reasonable sampling of everything else in the environment, but given enough iterations, it can selfheal. Also, at 1 as well as 32 iterations, we see some noise near corners and shadows on the floor, as well as some banding on the vertical walls. This is due to the fact that hemicubes can take up a good deal of memory, so in general, they are very low resolution. Low resolution implies that you are subject to a good deal of aliasing artifacts. Many of these problems disappeared on their own after more iterations, but as a rule, it's not good to depend on luck and apply some filtering to your hemicube input [Hugo Elias writes of an excellent hemicube filtering technique that not only reduces overall computation time, but doesn't affect the overall energy input from a sample], and perhaps also some smoothing passes on the patches themselves. In this particular scene, the reflectivities of the surfaces were sufficiently large (and the scene itself is sufficiently small) that a mere 320 iterations is fairly effective. There are some shadowing artifacts still remaining, which are a result of the model itself  the colored walls are separate planes that clip into the existing box, which has had faces deleted. Even so, JPEG artifacts aside, the result after 320 iterations, is quite good (for a full resolution PNG file, click here). In a normal scene without very strong lighting, though, one should really expect several more iterations to be necessary before a good result is apparent. On a side note, I would like to mention that if the images look dark to you, it's probably because they've been gammaadjusted for a setting of 2.20. I rendered and viewed these on an LCD screen of pretty high contrast ratio.
Many people often start out in their quests to solve GI by implementing radiosity. Why? Because it's quite simple to take the results, apply it to a very dense polygon mesh and then render it in a simple OpenGL or D3D render window and then view the result from any viewpoint. I did say earlier, however, that radiosity only handles Lambertian surfaces. This isn't quite the whole truth. The fact is that what it handles is light being reflected, absorbed, or scattered equally with respect to all directions. In the case of surface properties, this means Lambertian surfaces. But radiosity can also work with volumes and participating media. Again, it does have to be an equalscattering medium, or what is referred to as an isotropically scattering medium. Note that with a surface, it is assumed that light can only be reflected or absorbed, which limits the scattering range to a hemisphere (or hemicube) about the normal to the surface. In a volume, it is possible for light to pass straight through without being affected at all, or only be slightly scattered. Since we're talking about an isotropic volume, we know that all outcomes are equally likely. This means that for volumes, we don't sample form factors with hemicubes, but whole cubes. Also, there are variations on ordinary radiosity that allow for speculardiffuse interactions, but specularspecular is a bit more of an issue. Some of the earliest global illumination solvers actually used raytracing to solve specular components and radiosity to handle diffuse components. Even with that, there is still not enough to say that we have a very comprehensive global illumination solver.
Those of us who have walked into graphics starting out with realtime rendering techniques and knowing little of BRDFs and BSDFs outside of Phong and Blinn models would probably have been quite taken aback at the sight of a double area integral earlier on. Eeep!! What the hell is this?!? Scary math!! Frankly, we're just getting started. The reason this scary math even exists is because GI in itself is the process of evaluating an integral. Just wrap your mind around that for a moment. How is that the case? Well, first think about the term "global illumination". Global, as in, from everywhere of every type, in every possible way. In short, there is light bouncing around every which way, and so no matter where you are in the world, there is light coming into that point from every conceivable direction. Now in a perfect world where we have infinite memory and infinite processor power, we could gather information about all the light in every direction and add it all up including not only direct lighting, but also light that has bounced around an infinite number of times... and every other source of light in between.
Well, in practice, we don't have infinite memory or infinite processor power, so we have no choice. First of all, we have to limit the number of bounces we can consider (and only a few are really necessary in most cases anyway). We can't really sample lighting in *every* direction. We can, however, sample lighting information in many, many directions and come close to the same result. The idea being that the more directions we sample along, the tighter the "grid" of solid angles which we use to evaluate the total set of all lighting. Well, if we're trying to sum values of a function which is evaluated at several sample points... and we're trying to space those samples closer and closer together and take more and more total samples... What do you know? That's the same as plain old numerical integration.
However, numerical integration is hellacious in terms of computational resources. It would take forever if we tried to evaluate actual integrals. There is something to save us, though  Monte Carlo integration. It may sound surprising, but Monte Carlo methods allow us to transform problems of numerical quadrature into statistical problems  more specifically, numerical integration will become an expected value problem. This may sound somewhat surprising, but it's true. So before I go on to explain Monte Carlo integration in a rendering context, let me first try and give you an example in a simple integration problem.
Let's consider this example case. It doesn't matter what f(x) is as long as we can evaluate it. It doesn't matter what the value of "a" is as long as it's larger than the highest value reached by the function in the domain we're trying to integrate within. In this case, the domain we're integrating along is x = 0 to b. What we've done is outlined a box where x is ranged [0..b], and y is ranged [0..a]. Now with normal numerical quadrature, I'd make a whole bunch of thin vertical rectangles or trapezoids or some other figure with vertical sides going straight up from the xaxis to whatever the curves value is at that point. I'd compute the area of all these really thin rectangles or whatever they are and just sum them up to get an approximation of the area underneath f(x) from x = 0..b. If I want a really good approximation, I'd use really thin rectangles.
Now for the Monte Carlo solution. Consider that box we've marked out. Now we know the area of that box. It's simply a * b. How does that help? Well, let's say I pick a point p within that box. Any point, doesn't matter where as long as it's inside the box. What's the probability that p is underneath the curve? Well, we know that there exists some area, S, though we don't yet know how much yet, underneath the curve. And similarly, there is it's converse, T, which is the amount of area above the curve within that box. Okay, we don't know what S and T are, but we do know this. The area under the curve plus the area above the curve should be equal to all the area that exists in the domain (the entire area of the box). That is, S+T = a*b. So out of all that area, a*b, the area under the curve occupies a fraction of it which is equal to S / (a * b). That's the answer to our probability question for p... What chance do we have of selecting a random point underneath the curve within the constraints of that arbitrary box? Simply, (area under the curve) / (area of the box). And this is the same probability for every point that we might want to check, since they're all essentially independent attempts.
As I said, though, Monte Carlo integration is an expected value problem. So what we do is select a whole bunch of randomly selected points inside that box. Some of them will be above the curve, and some will be below. With enough iterations, we should expect that (# of points under curve) / (total # of points tested) = (area under the curve) / (area of the box)... or some reasonable approximation thereof. Sounds pretty neat? It is. Those of us who weren't really fans of statistics are probably going to think that selecting points at random is slow and doesn't make much sense. In theory, you could just sample on a regular grid and it would be faster, and make a lot more sense. Well, that's fine for an integration problem as small as this one. Where Monte Carlo sampling really shows its strength is in its scaling. When moving up to larger problems and higher dimensional problems, random sampling gives you better results with fewer samples. Even more important than the scaling is the fact that by using random sampling, we actually decrease the risk of losing information. One, because structures themselves can be irregular, and two, because a regular grid is fixed, any information that you miss will always be missed. The problem with a perfect grid is that, you end up with axiallybiased results  and this bias does show itself in the form of aliasing artifacts. This same problem showed itself with hemicube form factors at a low resolution. As you might expect, though, random sampling is NOT biased. Which is why we love it so much.
While I'm on the subject of speed with Monte Carlo integration, I should bring up one more thing. When we think of random in computer world, we really think pseudorandom number generation. And in fact, for Monte Carlo methods, a PRNG isn't really the best choice. In fact, pseudorandom numbers are actually... a little too random.
On the left is a scatter plot of sample points from a pseudorandom number generator. On the right is the same number of sample points generated by a quasirandom number generator. Notice that with the PRNG, we have many regions where sample points "clump" together, and many areas where sampling is very sparse. This would cause some undersampling in some areas and oversampling in areas where it may not be needed. By contrast, the QRNG sample space is fairly even all over the box, and almost every local region throughout the sample space is wellcovered. It should be quite obvious from inspection alone that using a QRNG over a PRNG provides a far better convergence rate for Monte Carlo methods (1/N convergence as opposed to 1/sqrt(N)).
The sight of that QRNG scatter plot should hopefully light a bulb in your head. It looks almost like a grid, but it's uneven. Actually, the popular term is a grid that's been "jittered." While it isn't really a jittered grid, jittered grid sampling or "stratified" sampling is also an equally popular sampling method, and often a lot faster to generate than QRNs... On the other hand, a spherical grid is easy to generate and any old randomnumber generator can be used to jitter that grid. Although, again, the preferred grid generator would be a geodesic grid, to prevent oversampling at the poles.
So how does this work in a rendering context? Well, the first difference is that we don't have a simple yes/no question like we had with the 2d area example. Instead, we have lighting in so many places and in so many directions, and in each of these directions, we're dealing with amounts of lighting. So we're actually sampling values that are eventually summed together. Notably, we're not performing a simple 1d integration to compute area, but actually integrating over a surface area in 3d space (and maybe even in 4d, although, that's a later topic). The other big difference is that there is known importance to certain sampling directions. We know that light striking a surface at very sheer angles is not likely to have significant effect unless the source is of very high intensity. And that raises the third difference  the function we're integrating is not explicitly known, which is why we use the word "sampling" as opposed to "evaluating".
Just because a basic Whitted raytracer didn't solve anything but direct lighting, doesn't mean that we must immediately dismiss raytracing altogether. The thing about raytracing is that is based on singular unidirectional tracing of a single ray bounced around the scene for each and every pixel. Yes, I know, there's multiple rays per pixel if we take antialiasing into account, but that is essentially not all that different from having 1 ray per pixel in a very high resolution image. But that's what Monte Carlo methods are for. Rather than considering 1 ray bounce at every intersection between a ray and a surface, we consider a whole bunch. A whole... quasirandom... importanceweighted bunch. And now, we have a new name for it  Distribution Ray Tracing. In older literature, you might have also heard it referred to as "Distributed Ray Tracing", but the only thing I don't like about that name is that it sound like a ray tracing implementation for a distributed computing platform. In other cases, it might also simply be called "Monte Carlo Ray Tracing" or "Stochastic Ray Tracing."
Anyway, if we think about it, distribution raytracing and hemicube radiosity (or one pass of each) aren't all that different. Hemicube radiosity is much like the unrolling recursion phase of distribution raytracing  when all the results we've calculated come back to the point we're trying to solve for and get summed together. The difference is that we're not assuming a single simple surface model. Now that we have control over the sampling ranges, we can distribute them according to an arbitrary BRDF for the surface. And this means we have a nice simple solution for all kinds of light transport. Yes, indeedy. And if you'd read my weighted color accumulation article, you'd also realize that if you don't want to try and complicate your implementation with arbitrary sampling distributions, you can instead weight your samples and simply reject samples that are too low in weight.
The diagram on the left should give you the general idea. We start from the camera... Shoot a ray from the eye through some pixel. It strikes a surface, and that surface has some BRDF to it, which means that we have some distribution of rays to sample along. The sum of the lighting computations that we acquired through all these rays will be the result for this pixel. Anyway, some of these rays we shot out will strike other surfaces, some of them will do nothing, and some of them may even go straight to a lightsource (which isn't really in the diagram). Those rays that struck another surface? Well, they spawn their own distribution of extra rays bouncing around the scene, which can in turn spawn another set of rays. As you can probably guess, this can ultimately amount to a lot of rays through the scene.
Those gray lobes represent our BRDFweighted range of ray distribution (although, in reality, lobes caused by specularity would be centered about the reflection vector). Depending on how finely specular vs. diffuse a surface is, this lobe may be tighter. For instance, a glossy surface would have a mediumwidth lobe to it. A diffuse surface would have a perfect hemisphere for its lobe. A perfect mirror would have no lobe at all (or more accurately, a lobe so tight that all samples go in one direction). Anisotropically specular surfaces would have a lobe where the width may be wider in one axis than in another.
 On a side note, for diffuse surfaces, we often like to have cosineweighted distributions rather than perfectly even spherical around the normal because sheerangled light is assumed not to contribute very much. A simple way to generate these at random is to first generate an arbitrary random vector (which can be in any direction)... Normalize this vector so it is now a random direction on the unit sphere. Then add the normal vector of the surface, and renormalize the resulting vector. The only degenerate case would be those where the initiallygenerated random vector on the unit sphere happens to be close to or equal to the opposite of the normal vector. Those cases can just be thrown out for a repetition. 
Interestingly enough, this simple, albeit very computationally costly, variant on raytracing can model all types of light transport and can handle both area and point lightsources. Obviously, though, some types of transport take more effort than others. Glossy surfaces don't require very many sample directions at all to get a good result. A lambertian surface, otoh, just might require a lot depending on the scene. A mirror, of course, only requires one sample, but a foggy mirror is another matter. Then of course, there's caustics which, in general, require hideously many samples to get good results. Normally, refraction only involves 2 samples since it spawns a refracted ray and a reflected ray, but that will change if one intends to implement wavelengthdependent refraction. While it does involve a little more work than normal to handle participating media and volumetric scattering, it is possible to handle this as well. But again, while participating media does not require very many rays at each sampling location in the volume, a great number of samples need be made to get a good result (i.e. *very* high recursion depth), which in turn means that many total rays are sampled in the scene. In practice, though, nobody likes to take the exhaustive route to participating media and instead use bidirectional methods, which I'll get into later.
Even more interesting is the fact that the distribution doesn't just have to be going through space in accordance with the BRDF. For instance, depthoffield effects are generally achieved by starting with a distribution across an imaginary lens. Yet another trick we can play with is motion blur. In a scene with a moving camera or moving objects and so on, we can achieve an excellent motion blur effect by not only distributing the rays in space, but also in time. Randomly selected times also mean that objects are in a different position relative to the viewer which means that the same directions will behave differently. As you can probably expect, though, it's infinitely slower to do all of these things in addition to already monte carlo sampling in space at every intersection. That's just part of life.
Well, it sounds nice and all in spite of the very slow render times that plague the process... but there is one far larger problem that plagues Monte Carlo methods including distribution raytracing  noise. Because of the fact that we can't guarantee that adjacent pixels will sample the similar regions of the scene with similar weighting, and so there is always a little bit of variation. The obvious answer is to simply sample a lot more directions at every intersection and perhaps also recurse further. If you can afford the extra time to render, then by all means, this is the best approach. But the thing to watch out for is the fact that this noise, while it may be glaringly obvious to the eye, is often actually very small numerically. And with very small numerical differences, that most likely means that your sampling coverage is already such that the convergence is pretty far along, and will therefore be in a range of very slow convergence rate. The earliest samples (i.e., the primary rays) make the biggest difference. There are also things we can do to the actual image to help smooth the result. Similarly, the samples themselves can not only be directionally biased according to the BRDF, but also be weighted when averaging the results so that every pixel has a somewhat more even sampling with respect to direction  sort of cheating away the unevenness. In the end though, distribution raytracing, while it can do just about everything, the real future will lie in better performing algorithms.
Ummm... well, that may be pushing it, but it was still a really nice and useful paper, albeit not completely embraced in a hurry by the academic community. I'm referring to a classic written by James T. Kajiya, which appeared in the SIGGRAPH proceedings in 1986. The famous "Rendering Equation" paper. The rendering equation is not so much a magic universal equation that handles all types of rendering, but rather a metric of rendering schemes. It's what rendering schemes aspire to solve... It does not innately handle all optical phenomena, but rather, is written and used under the assumption that said optical phenomena are modeled by the algorithms. In short, we use it to measure up graphics models against each other.
The rendering equation, using Kajiya's notation is expressed as follows 
I(x,x') is the intensity of light transported between the arbitrary points x and x'.
g(x,x') is the visibility between x and x'. If the points are occluded from eachother's view, then this function will be zerovalued. Otherwise, it will vary with the inverse square of the distance between the two points.
epsilon(x,x') is the total potential amount of light transfer emittance from point x' to point x.
rho(x,x',x") is the scattering term or the BRDF/BSDF term with respect to x' and x". It is the intensity of light scattered towards point x from point x' which arrived from point (or along the direction) x". That is, if there was light reaching x' from x", how much of that light reaches x?
S is quite simply, the scene. The integration taking place over the space of the scene means that we are evaluating this for all points on all surfaces in the scene... or we'd like to, anyway.
If that all looks confusing, I'll simplify  What this equation says is that the light transport intensity from x' to x is equal to the sum of the light emitted from x' towards x and the total light coming from everywhere else in the scene that was scattered, reflected, or emitted towards point x', and then from x' to x. And in fact, radiosity, for instance, can be derived from this by reducing to all cases where reflectivity and/or scattering is a constant.
In Kajiya's paper, aside from the equation itself, he brings up a new method for solving the rendering equation  what he called "Path Tracing". If we consider all that sampling we do with Monte Carlo methods as analogous to "searching" through the entire space of light bounces that might have occurred, then we have an interesting comparison here. Distribution raytracing can be thought of in terms of a breadthfirst search of illumination space. And this makes sense  at every bounce we're sampling in several directions; not unlike a search tree that branches widely at each recursion depth. Path tracing, on the other hand, can be thought of much like a depthfirst search... almost, anyway. Kajiya describes it with the following diagram 
The digram on the left is representative of distribution raytracing. And the one on the right is indicative of path tracing. Note that the selection of which path is really arbitrary. And this is where I really find the simplicity in path tracing  we just select any old path to sample along. Okay, I'm going overboard, yes... We can't forget that we are sampling  we still have to remember that we can save a lot of time and unnecessary effort by importance sampling.
Monte Carlo Path tracing, in spite of everything else you might hear, is still pretty much the premier, stateoftheart rendering scheme out there (albeit subject to many further optimizations beyond the normal. While Jensen's photon mapping is a great performer that produces great looking results, scaling to larger scenes will put it noticeably behind MCPT in both performance, results, and especially memory usage (of course, in Jensen's implementation, a distribution raytracer is used in the final gathering pass, but a path tracer can be used as well). This is the case primarily because path tracing, like ray tracing, begins as a pixelscanning process, as opposed to photon mapping, which works in 3d space. In the process of MCPT, the first hit is really handled pretty much the same as with distribution ray tracing in that we randomly select directions to sample for indirect lighting. But that's where the similarities end. Remember that this is path tracing. In a pure path tracer, we'd only consider one stochastically selected sampling direction at any point (including the first hit) but shoot several initial rays through each pixel, averaging all the results. In a more generalized Monte Carlo path tracer meant to handle more situations of light transport, we shoot fewer intial rays (enough to get some decent antialiasing), but sample several stochastically selected directions at the first hit for indirect lighting. The main idea behind path tracing is that secondary rays have less effect than primaries, and the further rays have even less. So the difference is that as those random samples intersect other points in a scene, we only consider its direct lighting and one stochastic sampling path from that point.
This means that when sampling, we consider two ray types. The "shadow rays" and the "indirect rays." The shadow ray is the one with which, direct lighting for that sample point is computed. This ray is sent to a randomly selected point on an area light surface and then treated as if a point light. If the light happens to be a point light, then of course, the shadow ray is a direct vector to that point. If occluded or refracted away from the light, then we receive no direct lighting at that point, but could receive indirect lighting. And that's what those "indirect rays" are for. We send out quasirandom or jitteredgrid rays to bounce around the scene, and at each bounce we consider that point's direct lighting (with shadowing) and one indirect ray and the overall sum of inputs from all these paths is the indirect lighting.
Perhaps another visualization is in order.
Okay, so... trying to explain from the diagram. From the camera, we shoot a ray through a pixel... or rather, some part of a pixel. This is only a single sampling iteration, and we would actually consider a few stochastically chosen rays through the space of each pixel in order to prevent aliasing. That ray we've shot out hits a surface at point y1. At point y1, we calculate direct illumination using whatever BRDF that surface happens to have.
Next we shoot a ray in a stochastically chosen direction from y1. Now as I said, we would actually choose many of these at y1 and only at y1 because it's our first hit. For the sake of clarity, though, we'll continue through the recursion of only one of these paths. This random ray happens to hit some other surface at point y2. So we calculate direct lighting at y2, checking if the ray to the light is occluded by anything. Since this ray we shot out is an indirect lighting sample, its actual lighting contribution with respect to y1 will be dependent on the BRDF of the surface we hit at y1.
Then from y2, we select another random sample... this time only one. And like we did before, this random ray will probably strike some other surface (at y3, in this case). And as expected, we calculate direct lighting with shadowing at y3. And as expected, we can keep going recursing as far as we feel like doing direct lighting at point after point of rayhit. When we feel like stopping, the very last sample will be just direct lighting.
What we're essentially doing is tracing through the path of photons that get reflected to the eye through the pixel y0. Starting from the eye instead of from the lightsources, really means we're "backtracking" along that path. At each point, we're considering both the inscattering and adding them together to get an estimate of how much light would get scattered from the surface along the direction we came from. Since each surface has its own BRDF or BSDF, there is only so much probability that scattering can take place in a particular direction. This goes not only for inscattering, but outscattering as well. For example, the sample ray shot from y2 to y3 has a certain probability of actually being selected in the first place. This is how we weight the inscattering. The further inscattering from y2 to y1 is also weighted in that respect. However, the probability that the direct lighting at y2 will scatter to y1 also needs to be taken into account. That's the outscattering factor, and it's simply handled by scaling down the direct lighting factor according to its relative probability. Of course, actually computing this weighted direct lighting is trivial in the sense that it would be the same as computing direct lighting at y2 as if y1 is the eye. I only bring it up in this manner because it is a more accurate way of looking at the problem.
In a purely diffuse scene, like you'd simulate with radiosity, the computations are quite simple. Every sampling direction is cosine weighted and every scattering direction is of equal probability. At any given point, the outscattering factor will always be 1/pi.
I'm sure it still sounds pretty slow. And yes indeed, it is slow. This is not realtime rendering here, but at the very least, there is a lot we can do to make it faster... and I'll try to get into that, too. It's certainly faster than distribution raytracing simply because you're not sampling countless rays at each sample point. And as mentioned before, it scales very well with the scene size and complexity.
In fact, in the very most straightforward implementations of path tracing, you would randomly decide whether to shoot an indirect ray or perform direct lighting when you sample (and of course, stop and unroll recursion when you happen to decide on direct lighting). What I've actually brought up is "path tracing with direct lighting", which is one of the things that can improve the convergence rate of path tracing with fewer total samples per pixel. If we refer back to the rendering equation, you'll see that there is an emittance term as well as a scattering term within the integral. The emittance term is our direct lighting factor. And the scattering term is the indirect lighting factor. Notably, in the integral is another instance of I, which in turn has its own emittance and scattering terms to take into account. Raw path tracing accounts for both because of the fact that the recursing can stop randomly.
And for the part you've all been waiting for... a sample path tracing image 
This is one of the reference images Henrik Jensen used to compare his photon mapping results with path tracing results. This particular image was apparently rendered with 1000 samples per pixel, although it wasn't made clear how many samples were initial eye rays through the pixels and how many were stochastic indirect lighting samples at the first hit. Notice there is some noise throughout the image. While this is actually a pretty noticeable amount of noise, it's not insurmountable with some good image postprocessing either way. As you might expect, the ceiling has the most noise of any other part of the image, and this is because it receives no direct lighting. All of its illumination is indirect and so the smoothness of the result is dependent on convergence of the indirect sampling to produce smooth results.
Knowing what's really important...
There's plenty we can do to speed up Monte Carlo methods. Probably the first one that people want to implement (although most likely the most involved) is importance sampling. The idea is simply to direct more sample rays towards directions that will make a bigger difference. What makes it complicated is that it's not always obvious when, where, and what is really important. Now, normally, if you implement BSDFweighted sampling, that is importance sampling, but not all there is to it. BSDFweighted sampling is still good enough and general enough to cover all cases, though, so I (as well as the majority of others) don't advise going further unless performance is really important to you. There is such a thing, though, as specializing the importance to the scope of a scene  or scenelevel importance(as opposed to pointlevel importance). In fact, this is one of the places where photon mapping can be used to accelerate other Monte Carlo methods. If you look back at the Jensen sample pic, you'll notice that for the most part the lighting and shadows are smooth everywhere. The exceptions are those two caustics on the floor and blue wall. Which suggests that in areas near the objects that cause caustics (refractive and reflective), we need to sample towards those objects more than we normally might. Caustics are a highfrequency lighting phenomenon for the most part, so tight sampling is necessary. We could cover for it by just taking a lot of samples at the first hit, but that would be too costly. Of course, then that raises the question of when we should even bother biasing our samples. It seems like the larger brighter caustic happens to fall within the umbra of the glass ball's shadow. That's a simple enough test to determine, but then it doesn't explain the small spot caustic on the blue wall, which is actually caused by reflections off the reflective ball that get refracted through the glass ball. In fact, that caustic doesn't even appear to be in the penumbra. The umbra test would prove rather fruitless in this case. For that matter, if you can imagine filling the Cornell Box partway with turbulent water, every point below the water surface would be in the umbra of its shadow, so every point below would need some heavier sampling towards various points on the water surface. Unfortunately, since we're talking about hypothetically filling the box with water, the water surface would occupy the entire width of the box, so how tight the sampling can get is rather questionable. It also doesn't account for the fact that there will also be reflective caustics above the water surface. Similarly, a reflective caustic, like the famous cardioid caustics caused by ring objects is a completely different case. We could say that points within the inner radius would be subject to caustics, but that doesn't really work in the same case if the reflective ring were raised above the ground surface. Again reflective caustics can include more unique cases like a spot of light on a faraway wall that comes from sunlight reflecting off of a watch. It's a lot tougher to say that there is importance in those cases.
In short, importance sampling is one of those things that we can say is quite obvious by human intuition, but not so obvious to quantify. Technically, using BRDF/BSDF weighted sampling covers all the ground that classifies importance sampling since it uses a probability distribution function to control the sampling rather than using a purely unbiased sampling (in most cases, though, BRDF/BSDF weighted sampling is all you really ought to do). In any case, it is much easier to use an unbiased sampling, and moreover, it always takes more total samples to make a biased sampling look like it is unbiased. In reality, a biased sampling relative to a point can still be unbiased relative to the actual scene. This is why, for instance, Photon Mapping separates irradiance, volume, and caustics into separate passes and photon maps. In each pass, photons are exclusively shot to the areas of the scene that need it.
25 samples/pixel 
125 samples/pixel 
625 samples/pixel 
All the images amove have 5 initial eye rays for antialiasing, and so 25 samples/pixel has only 5 indirect sample rays at the first hit. Notably, because of the importance sampling, even the very lowsample count examples still look fairly decent, and caustics are already showing up as early 25 samples per pixel. For the most part, it doesn't really make much of a difference if we made all the 25 sample rays initial eye rays or if all the 25 were secondary samples  the latter will show aliasing artifacts, but that's about the only difference we might see on the whole. The reason I chose, for this scene, to do only 5 eye rays per pixel is simply because I only cared to have enough eye rays to battle aliasing. The other thing was the camera angle was such that you can't really see very much direct lighting, so I simply placed more weight on indirect lighting samples. Either way, I would have gotten a pretty good sampling radius assuming the importance sampling was working nicely. But having them all be eye rays means that every ray has to have the first hit computed for it. With more indirect rays, a large percentage of the rays bouncing through the scene trace back to the same first hit. So if nothing else, I did bring down my render times.
If you want a fullquality render, here you go  1536 samples/pxl, Render Time 27:31 on P4C 2.8 Ghz, with 6 initial eye rays, and 256 secondary samples.
Slow, but satisfying. That tends to be the story of life when dealing with socalled photorealistic rendering. But the fact of the matter is that we'd be a lot worse off in speed if we chose to be completely unbiased. Then of course, we have to hope and pray that our biased sampling doesn't *look* biased. Since importance changes the probability of sampling in a particular direction, it means that samples that randomly fall out of that probability would ignore the biasing. But this means that in order not to appear biased, we need to have the probabilities set pretty much juuust right. Either that, or just use lots of samples.
Aside from importance sampling, there are other techniques you can use to save computation time and improve overall performance. Much as it might be counterintuitive, Russian Roulette is one of those techniques. Russian Roulette is also one of the much easier techniques to implement, and is often advised, especially when working with techniques where more samples not only increases computational cost, but also memory usage (e.g. photon mapping, irradiance caches). Russian Roulette is a method by which you randomly decide whether or not to continue recursing further down a path. It can save you a lot of computation time by reducing the average recursion depth of paths.
It especially proves valuable in photon mapping where you can greatly reduce the number of photons shot and the number of photons stored. Which in turn saves you a great deal of time when balancing the kdtrees and in final gathering.
In general, the most preferred method for Russian Roulette is derived from Eric LaFortune's 1994 paper  "Using the Modified Phong Reflectance Model for Physically Based Rendering." In this one, a random scalar (f) is generated, and compared against the luminance. If f is greater than the sum of diffuse and specular luminance, then the sampling is stopped. If f is greater than diffuse alone, then a sample weighted by importance with the specular model alone is generated. Otherwise, you generate a sample importance weighted by the diffuse model alone. Of course, some basic depth limiting is also applied so that we don't recurse infinitely. The main advantage of this particular method is how much better it integrates with respect to the existing BRDFweighted importance sampling and adapts to local surface conditions. Overall variance is reduced as compared to basic random yes/no.
If unaware of some of the history behind the nomenclature, it can be kind of amusing to notice how all these processes are named after cities. Monte Carlo methods, Manhattan distance, Las Vegas algorithms, etc. And in this section, I'm going to be talking about Metropolis light transport sampling. And in case you're wondering, this is named after Nicholas Constantine Metropolis (who, in fact, coined the term "Monte Carlo" methods, while working on the "Manhattan" Project). Metropolis sampling can be thought of as a further step forward from importance sampling. The idea being that you take some path that you're already sampling and alter it, and if that altered path is important, you keep it as a formal sample, otherwise, you forget it.
Now doesn't that sound easy? Well, that's because you're crazy. It's damn hard! The idea of taking the probability that a path makes a significant contribution and thereby choosing whether or not to keep the path is the easy part. The hard part is developing a good mutation strategy. One of the key points of Monte Carlo methods is that the variance keeps decreasing with more samples, but that's because all the sample directions we look at are independent of each other. In the case of Metropolis, you have a whole lot of correlated samples, which causes higher variance and higher bias. So the important part of this is to choose your mutations judiciously. Bear in mind that you're not changing a path along which to integrate, you're generating a variation of the same path integration.
If your acceptance probabilities are very low on average, then you don't end up keeping many variations except maybe exceptionally long ones, which generally leads to multiple paths affecting the same pixel on the image  what does that look like? Noise. Similarly, we also don't want to have our mutations be too small. Otherwise, we end up oversampling a few lights or a few interreflections. Ideally, we'd want to create mutations that result in, for example, changing from a specular reflection to a diffuse reflection, or an increase or decrease in path length. Let's take an example  We have a path made up of a series of QMC ray samples that happens to go through the points A,B,C,D,and E in the world (A is the camera). Now we do a simple mutation by adding a vertex "z" in the path between C and D. So now the new path is A>B>C>z>D>E. Now if this new variation is accepted, it's another value summed up in the path integral up until point C in the path. Similarly, it's possible to have another point in between points A and B, which results in the same path integral being used for part of the lighting solution for another pixel on screen, if such a path mutation is possible. Other possibilities include bidirectional mutation strategies where subpath sections are joined together to create a new path variation. While we cannot alter path vertices when you are dealing with absolutes like perfectly specular reflection or refraction, we can alter other sections. In a more naive sense, you can attempt to alter a path at locations following a specular reflection, but you'd find that the BSDF yields a single valid scattering direction, and find that the acceptance probability of any variation is absolutely zero, so it gets rejected anyway.
The main thing that makes this modification of paths so fast is the fact that when you go through path tracing steps, is that path integral is known up to each point. At any given point as you start unrolling the recursion down a sample path, you have the path integral computed up to that point, so a variation in the path such as adding or removing or replacing a vertex adds an extra term and/or removes an existing one up to that point in the path. Because you only have a single weighted lighting computation per mutation that you actually keep, it's fairly cheap. In essence, what Metropolis Light Transport is, is an interesting and inexpensive way to take path tracing... reuse paths as much as possible... and turn it into distribution raytracing. In the end, this means that for each path we do try to take, we get more than one effective path actually sampled, and more paths traced means less noise... in theory, anyway. One of the weaknesses is that it still has inherent noise issues in darker regions. The end result of having all these correlated sample paths is that we don't go down in noise as quickly as we'd like.
Diagram time, again 
Now, consider the picture above. We've got our eye, our image plane, some objects and a lightsource. The blue path represents a single path considered in a simple MCPT renderer  like the MCPT w/ direct lighting diagram, we'd probably still be considering direct lighting at these sample points, but for now, I'm concentrating exclusively on the path traced through the world... the last step of the path would be the last point where direct lighting is done, so I included a lastpointtolight vector. The dashed red arrows represent some sample mutations. Let's say our original path is going through the vertices Eye>A>B>Light... now in one of the mutations, we've removed the vertex B. So it becomes Eye>A>Light. In practice, this sort of path would be rejected because it's really just a simple direct lighting sample for the pixel. Even if we were not doing the direct lighting anyway, we'd reject this because the stochastic choosing/not choosing of direct lighting introduces too much highintensity noise  it's much better for the sake of noise to have explicit direct lighting than implicit  implicit direct lighting has its place from a theoretical standpoint for extreme cases of indirect lighting and also for statistical studies of variance in specific contexts, but not if you're trying to shoot for render quality above all else. In another mutation, we considered adding a vertex v, so we now have Eye>A>v>B>Light. This is a much more likely one than the former, and assuming the BRDFs of these objects says it's okay, we'll most likely use this one. The thing that makes it cheap is that we can take our original path and remove a ray sample (the A>B sample) and insert two new ones (A>v and v>B). The sampling of Eye>A and B>Light are already done. Another possibility is that I could take A out of my path, and instead end up with an Eye>v>B>Light path. Nice thing about doing this is, all these previous calculations tracing back along that path from the light up to vertex v can now be used as a contribution to another pixel in the image.
Bear in mind that this removing and adding of vertices is just one of many mutation processes described in Veach's thesis. The extra case I mentioned of taking vertex A out of the path and using the altered path on a new pixel in the viewport is another one (referred to as lens subpath mutations). Among others are bidirectional mutations, which are good for further exploration of the sample space. Lens perturbations, caustic mutations, multichain mutations, etc. There are even strategies that effectively replace the whole path. Technically, if you limit the path segment replacement such that the very last vertex is always the very last vertex (and it is not occluded from the light, then this accomplishes the same goal as bidirectional mutations, and some people do implement this instead. This is why, even though it's a relatively old paper, there are still hardly any full implementations of MLT that achieves everything that the original paper by Veach portrayed. As of writing this, to this day, there are only 2 or 3 full implementations (and that's including Veach's own implementation), along with dozens and dozens of partial implementations where someone handled a few mutation strategies that cover a few key problems that MLT is meant to solve in less computational time. To date, though, nobody has really met the performance levels that Eric Veach achieved in his renderer, although many come close. Many people look at MLT and think it's a perfect solution for the performance issues, but the actual time you'll spend on renders won't be anything miraculous. Okay, you might spend a day to render an image that would otherwise take 2 days with MCPT, but you're not going to go all the way down to minutes and seconds.
Ah, yes... here we go coming down to everybody's favorite. The thing that made Henrik Wann Jensen a hero to us all  Photon Mapping. Yes, it's a damn fast method, and it's often believed that photon mapping is the endall beall catchall rendering algorithm and it does everything short of helping you lose weight. Photon mapping is a global illumination algorithm which is derived from the older particle tracing technique, which itself is just a further extension from forward raytracing. I should note that occasionally, people in various books get forward and backward raytracing mixed up simply because doing forward raytracing is not feasible in practice, so backwards raytracing is the only type we usually perform.
The idea behind photon mapping is to store where light hits surfaces as it bounces around through a scene and use density estimates to get a smooth reconstruction of lighting. Now a direct visualization of these density estimates will result in something with a lot of lowfrequency variance. And while it's easy to get rid of highfrequency noise with some postprocessing, it's very hard to get rid of lowerfrequency noise. Well, in order to deal with this, we actually get indirect illumination by going through a final gather pass and getting irradiances from many places in addition to the usual direct lighting. In Jensen's own implementation, this final gather is a distribution raytracing pass, but a pathtracing or MLT gather pass is perfectly feasible, and in fact, many 2pass GI algorithms use a final gather pass this way. Now everybody accepts that photon mapping is very fast... and the fast part happens to be bouncing the light around the scene. Unfortunately, we can't really say the same about the final gathering and the density estimation. If not for this, a 25minute photon map render would instead take about 25seconds.
In general, the density estimation is done by searching for the k nearest neighbouring photons from a point and seeing how large of a sphere can contain all these k photons. Well, normally we store these photons in a kdtree to make this knearestneighbours search easy. A kdtree is not that different from a BSP tree except that instead of using real planes from the world geometry, you use imaginary axisaligned planes, and the axis alignment of the plane you choose is a given as a function of the depth in the tree (normally, anyway, though this isn't a requirement). Well, to make sure that this search is also reasonably fast, we also have to balance the kdtree. This operation, as it so happens... does not run reasonably fast at all. That kdtree is storing hundreds of thousands if not millions of photon strike locations. And because any planar split is deterministically aligned with respect to its depth, we have to maintain that feature of the tree when balancing. So it's not quite as easy as balancing a BSP tree because we have to rotate three ways at a time. It proves to be incredibly slow in the end. While there are other options for the data structure, (e.g., loose octrees, BVTs), kdtrees are generically the most often chosen because that is what Jensen himself used. While I have often found in practice that normal octrees are infinitely easier to work with and it does save a reasonable amount of render time, it is also very prone to aliasing issues, and it also doesn't scale quite as well with photon map resolution  while you have the option of only subdividing down to x photons per node, this means you have an exhaustive search inside each node, and even if you allow the subdivision to go extremely deep, you run into precision and memory issues.
Then we come down to our final gathering, which is another slow part. Now certain things we can directly visualize  like caustics. This is partly because caustics are only visible on primarily diffuse surfaces. But even more so because caustics are such a highfrequency lighting phenomenon and the little bit of noise we might see is something to expect even in the real world. Also, it's very hard to get a good convergence rate on caustics with regular Monte Carlo sampling, and that's why photon mapping tends to separate caustics into a separate caustics photon map which is directly visualized with a raycast and density estimation. That also means that in order to get good covariance characters within a caustic, you need a LOT of photons to be mapped within small spaces to get some halfway decent idea of the overall form. It's perfectly normal even to have the caustics map carry more than twice the total number of photons as the global photon map.
Now one of the bad things about having this separation into a caustics map, a global map, a volume map, and so on is that you end up having to do that many more passes to support more features, which greatly takes away from the overall speed advantage of photon mapping in the first place. One of the really great things about it, though, is that it makes photon mapping itself very flexible in how you choose to use it. In fact, before Jensen published the idea of using photon mapping by itself as a global illumination estimator, he had used it to build importance maps for a scene when performing other Monte Carlo illumination algorithms. One of the things that is done in the global photon map is classification of the different types of photon strikes  that being, direct illumination, indirect illumination, and shadow photons. Direct illumination photons are generally the obvious ones where a photon's strike on a surface happens to be the first in a stochastically selected path from the light. Beyond that point, any photon strikes along this path would be indirect illumination photons. Shadow photons are not really true photon strikes per se, but a followup along a direction where a direct illumination photon strikes an object. These shadow photons hit the world at a "wouldbe" point if the object that the previous photon hit wasn't there. The reason we have these shadow photons is for a few reasons. One, the fact that we use density estimates to get irradiance estimates, having a lack of photons is not enough to say that we're in shadow and the end result would be an artificially soft and slightly overbrightened shadows. The other thing (and this helps us in the speed department), is the fact that generally, you will see portions of a scene that is either entirely in shadow or entirely illuminated, so having the shadow photons helps reduce the number of shadow ray samples we need in the final gather pass.
Well, going over with a diagram or two... We have a light and a scene. We trace thousands of rays from the lightsource and bounce them around throughout the scene. The blue rays would be our direct lighting rays. Each of these carries some amount of energy (luminance,color) with them. These will be stochastically bounced around within the scene generating indirect ray samples  the red rays in the diagram. Obviously, the red rays would have some different energy based on the color and BRDF of the surface it hits. Wherever a ray hits, we store info about position, incident direction, energy, surface normal, type, etc. Actually, if using a kdtree, the position can be determined through the tree traversal. Also, one of the optimizations involving irradiance caching means we can also store a local radiance estimate to speed things up, but that's another matter. Of course, our tracing of indirect rays applies the usual Monte Carlo sampling principles like BRDF/BSDF weighted sampling, Russian roulette, and so on.
The black dashed rays are the shadow rays. Where the blue direct lighting rays hit the sphere, we shot a shadow ray along that same direction until the ray hit some other object. Obviously, we don't create any shadow photon points for shadow rays that go outside the scene. The points labeled U and P represent shadow photons within the umbra and penumbra (or at least close to the penumbra, if not in it). The main idea is that around that point U, chances are that all the nearby photons are going to be other shadow photons, and knowing that, we can assume that if a ray sample in the final gathering phase hits the surface here at point U, we can guarantee that the point is in shadow and doesn't need any shadow samples or direct lighting samples. A reconstruction of the indirect lighting alone would be enough. Now point P, on the other hand is pretty near the edge of the sphere's shadow, which means that there is a potential for many of the nearby photons to include both shadow and direct and indirect lighitng photons. Basically, if the local area doesn't have enough shadow or illumination photons, or if the nearby photons includes a mixture of shadow and direct illumination photons, then only we actually take shadow ray samples during the final gather.
Well, that's the easy part... And also the very fast part. Shooting hundreds of thousands of rays through a scene is pretty fast. Granted, this doesn't include the caustics, although we could simulate very simplistic caustics without the aid of a specific caustics photon map. However, for the most part, it's safer to assume that there is some complexity and high frequency components that would warrant the use of a caustics map. There's a bit of setup work involved in that, though, where you would want to create a few lights that would direct all their photons to surfaces that could cause caustics. For more simplistic renderings of caustics, we can consider the approach of simply taking all photons resulting from LS+D paths directly into the caustics map. As a general rule, the global photon map contains hits along L(SD)*D paths not including the LS+D paths.
The slow parts of photon mapping tends to be the organizing and ordering of the kdtree and the final gathering phase. Technically, it's not 100% necessary to use a kdtree, but Jensen considers this ideal for the speed of nearest neighbour searches. There are cases, such as the limited GPU implementations of caustics photon mapping where a kdtree is simply out of the question. Also, there are studies going into realtime photon mapping using very rough radiance estimates with very few photons, where again kdtrees are a poor choice because the act of balancing the tree is incredibly slow. Then there's the final gathering phase, which, in order to deal with the sampling noise, uses distribution raytracing or monte carlo path tracing or some other raytracing scan using the local radiance estimates wherever secondary rays hit in order to compute the indirect lighting component. Typically, these two portions put together can consume some 25 orders of magnitude more time than the initial photon tracing through the scene (especially the final gather, which is resolution dependent).
Part of the reason photon mapping is so fast is because the initial phase is not resolution dependent. Unfortunately, this is also the thing that makes it very hard to scale up to large scene volumes. In cases where the size of a scene relative to its lights is large, it takes many photons to get a detailed result. While few photons and some artificial increase in apparent brightness is sufficient for lowfrequency lighting, it's not sufficient for parts of a scene which exhibit highfrequency lighting characteristics like caustics or reflective paths off of glossy surfaces. It also means that anything within a scene that inherently carries some high frequency surface details demands more photons to acquire good covariance characters. For small scenes, this is usually fine. For large scenes with few lights, it often doesn't scale well. Which is not to say that photon mapping isn't of use outside small scenes, but to say that photon mapping by itself is not a guaranteed universal win.
Where it serves its purpose exceptionally well is in filling an irradiance cache for further computation. An irradiance cache is, as the name suggests, a map over the scene that that caches irradiance information. The main purpose it serves is to save explicit lighting samples by reducing them to data sampling problems. While there are plenty of ways to fill in such a map, photon map sampling works out as one of the better methods because of the fact that it not only gives us direct lighting, but also indirect lighting, caustics, and shadow information. While an irradiance cache wasn't part of the original design of a photon mapper, it's now relatively a given for any photon map renderer as it's almost always a massive performance win.




This scene was rendered using ~820,000 photon strikes (200,000 photons shot) stored for the global photon map and ~560,000 stored (485,000 photons shot) in the caustics map. A high quality higher resolution (960x720) render is also available. Unlike the renders above which have no gamma correction, this image is gamma corrected for a gamma of 2.2. This render took 1:30:55 hours, approximately 2.25x as long as the lower resolution (512x384) renders in the table above, while being about 3.5x the resolution. This suggests that the biggest chunk of time is in the construction of the irradiance cache and/or the final gathering, both of which are resolution dependent portions of the renderer. Indeed having somewhat over a million points in the kdtree is probably mildly above average for a scene of this size given a max recursion depth of 5 for every photon's path. Notably, there are still errors on the far wall, and this is largely due to the scene setup which didn't account for reflective caustics, and so it shows up as smatterings of high photon density. Since the search range for the global map was for the nearest 350 photons, the fact that we see small dots on the wall suggest causticslevel density in that area, but not enough photons to cover the problem. In practice, problems like this would probably be postprocessed away and treated as trash, rather than trying to rework the scene to suit the particular problem. Only for academic demonstrations and studies do we really care about things like that  in most cases, the caustic on the floor provides enough information to make a relatively convincing render, and it is the point that inherently draws a viewer's attention.
In feature film cases, things like photon maps are often saved for specific subtasks (usually caustics rendering more than anything else). Among the earlier known cases was the first of the Final Fantasy movies : The Spirits Within (though I guess some would argue it's more fair to call it the first CG movie to carry the "Final Fantasy" name), which pretty much used only a caustics photon map. This has been changing little by little over the years. Where 5 or 6 years ago, film CG would almost never resort to any form of GI and even those that did might at most presample a single indirect lighting bounce and live with the precalculated result for several frames, more recent films will use limited GI solutions almost exclusively. Computing power has gotten cheaper, but we're still a ways off.
There has been a great deal of talk in recent years about a resurgence of raytracing and the realtime execution of GI approximations. A lot of groups, particularly the Saarland RPU group, have demonstrated hardware prototypes which show promising results. Especially given their limitations of being very low clock speed and very poor memory bandwidth, yet still performing on parity with midrange singlecore PC CPUs. In reality, we're probably a very long way away, though I admit this rather begrudgingly as I am a raytracing psychopath. In terms of raw computational power, a lot of existing GPUs are far enough along to have some impressive results  if the architectures were only suited to the task. Raytracing by nature is recursive and relies a great deal on sampling information from an entire scene, whereas GPUs are designed specifically to do the exact opposite  work on linear calculations with no selfdependencies worrying only about a single scene element at a time. Even more importantly, scene complexity is always climbing, and sampling data from the scene means that a raytracer requires random access within large datasets, and modern memory architectures are nowhere near good enough to handle this. This is generally the limiting factor when large renderfarms do CG work for films  the computational power is there, but there is massive latency and hideously poor bandwidth when moving data around. Even if a single PC acquires the computational resources to do the work, we will basically be memory accessbound rather than CPUbound, and unfortunately, memory evolves at an unacceptably pitiful rate.
In spite of everything, there is always a move to try and get global illumination effects to be feasible in realtime. It's still the holy grail of graphics, even if the holy grail itself is an incomplete item. Of course, it's silly to think we can actually do anything that ranks in the same league as any of the aforementioned techniques. So what do we do? We cheat.
The most classical and obvious of cheats that I'm sure everyone has seen is the constant ambient lighting component. The idea is simply to say that lighting bouncing around in the scene makes everything in the scene just a little brighter. And if you were to look at things like the progressive radiosity renders, it's probably the most obvious conclusion to draw. But as we all know, it's also very wrong, or rather, greatly oversimplified  it occasionally looks passable, or whatever "flatness" it ends up creating can be overcome by other forces. The problem is that, in reality these sorts of interreflections that we're trying to simulate are the result of the objects in a scene and where they are, what shape they are, what their reflectance characteristics are like, and so on. Among the nicer tricks out there include directionally dependent ambient. This is usually done with a lowresolution cubemap, but it's also just as well using 6 colors (which is more compact) for each axial direction. Probe points are placed in a scene which precalculate samplings and average them to get a rough estimate of light bouncing from every direction. The colors are blended using a surface normal, and that gives you a directionallydependent ambient lighting. Various games have used this technique to varying success (HalfLife 2 being a good example), and while it works nicely for dynamic objects moving about (e.g. characters) in a scene, it doesn't help much for the scene itself.
The classical approach for scene lighting is basic lightmapping. This is handy because higher quality lighting solutions can be applied and saved off in a texture. If someone chooses to apply radiosity or photon mapping or lord knows what on the scene, it can be saved off. Of course, among the obvious problems with lightmapping include the fact that it is, by nature, low detail, and it is secondly only applicable for static lighting and static scenery. While a good deal of the detail aspects can be dealt with (for instance, using 3 lightmaps for each of the tangent space directions and then blending between them based on the normal map normal), the static lighting part is simply the nature of the beast, even though there is nothing stopping you from additionally including dynamic lighting on top of lightmaps. However, with technology constantly advancing, more games tend to show that entirely dynamic lighting is technically feasible for a lot of cases, and so the public at large tends to expect it of all titles (or at least enough of it to fool them that the scene is dynamically lit). You might notice, though, that you can't lightmap a character or some other dynamic object. Well, nonetheless it works out nicely because the environment covers the majority of the screen, which means that most of what the user looks at looks very nicely lit (though not completely so).
Winning a great deal of interest these days tends to be the use of spherical harmonics bases. Spherical harmonics are little more than a Fourier series projected into spherical coordinates. Like the Fourier transform you're hopefully familiar with, a sum of harmonics down the frequency range captures progressively higher and higher frequency lighting components. While such a transform can get pretty massive, the parts we're usually interested in are the lowfrequency components. The "constant ambient" term we're familiar with is effectively equivalent to the constant term in the Fourier series here. In actual implementation, spherical harmonics coefficients are captured by convolving the basis functions with the lighting results. In the realtime application, lighting information is transformed into the same basis space. This is what makes it very convenient for realtime application in that the preprocess can apply whatever offline solution as with lightmaps, but end up with pervertex or perpixel evaluation of dynamic lighting given that lighting model (or at least a few frequency bands of it). The most common choice is to use the first 3 orders of the basis, which comes out to 9 coefficients to store either per vertex or in a series of textures. Of course, as you might expect, this is byh nature, low frequency, which means low detail, and because it depends on a prior offline preprocess, it is limited to static geometry. Though it can be used individually for geometry that moves through the scene, it still demands that the object is rigid geometry (i.e. must not deform in any way), and the results are also less complete by nature than a scene result because you cannot simulate the full solution for that object in the scene at any point. Of course, spherical harmonics are only of the precomputed radiance transfer solutions in exists. There is recent research on wavelets in the PRT domain as wavelets are generally able to cover more of the frequency range with fewer coefficients. However, PRT in general still has the same limitations of anything else that may be precalculated.
Other approaches in existence involve the idea of sampling PRT information at grid points in space in a scene and interpolating between the coefficients as an object moves about. This is the socalled "irradiance volume" approach where you are essentially filling what is most likely spherical harmonics samplings for various points in the volume occupied rather than specific objects. This is actually quite effective in practice, but speed is still an issue when also being combined with quality. Also, it's most particularly useful strictly for the indirect lighting components. Using it for shadows and direct lighting often doesn't work out very well in practice, but that is still acceptable. It remains to be seen though, how well it integrates with all types of scenery, though. While such a technique fits quite easily with indoor lighting problems, it is more difficult to effectively author an ideal setup for outdoor lighting, particularly because outdoor scenery is, by nature, open volumetric space. And this is the sort of problem that comes up even if someone were to come up with some ideal solution that is also very esoteric. It will ultimately not prove useful if it requires too much of content creators.
In general, though, whatever we choose to do to try and simulate diffuse interreflections globally is a good thing in the end results. Direct illumination models alone cannot produce images of convincing enough illumination characteristics to be of any use as a method towards photorealism. It's been used to evoke certain imagery or serve alternative functions (e.g. hard shadows in Splinter Cell provided hiding space). In the long run, there may be some point at which the raw computational power necessary to explicitly compute a lot of these quantities is available, but even by that point, it is more than 100% unlikely that memory architectures will have evolved far enough. An often missing point in the talk of "realtime global illumination" is that the amount of data to be dealing with and the random access patterns is not a trivial hurdle to overcome. I am often left wondering whether it will ever happen. Even as architectures evolve, so too will the content, and it's entirely possible that memory will never catch up.
So I guess this is a very long winded way of saying that Global Illumination is about illumination from all sources, emitted, reflected, scattered, etc. So if I wanted to be really concise about it, I'd just say Global Illumination is direct lighting + indirect lighting, but then you wouldn't learn anything, would you? More importantly, "indirect" demands some definition to those uninitiated in the ways of the pixel. As I've said before, I'm personally a fan of Monte Carlo Path Tracing, not so much for performance reasons, but because it's so utterly simple. It's not without its flaws, of course, which is why you may differ in your opinion. Whatever it is, at least this document is just one source of information by which you can form an opinion (assuming you don't already have some knowledge on the matter to begin with).
GI is not something to fear or revere, really. It's just another example of us trying to work out how to play digital god.