100042 ms... my raytracer took 100042 ms to run. I thought it must have been a bug. There must be some infinite for loop running somewhere. Nope. Just slow code. There has to be a better way. And my journey begins:
Step 1: Precomputed Values
In the initial assessment of my code, I observed multiple instances where the same calculations were being performed repeatedly. These instances were perfect candidates for optimization through precomputation. I hoped that by storing the results of these computations, I could significantly reduce the total amount of time spent calculating the same values over and over again.
How: Precomputing values involves calculating necessary computations ahead of time and storing the results in memory. These precalculated values are then available for instant access whenever they are needed, bypassing the need for redundant calculations. This optimization operates under the space-time tradeoff principle in computer science, opting to use more memory to store the computed values in order to save time during execution.
Why: Accessing a value stored in memory is a much faster operation than performing computations, particularly complex ones. So, by swapping computation time with memory access, the efficiency of the program is increased. This concept relies heavily on the mechanics of modern computer architecture, where memory operations are much faster than equivalent computations.
Impact: This resulted in a very slight performance boost, reducing runtime from 100042 ms to 98464 ms. It was a small victory, but an important one as it laid the groundwork for a mindset of efficiency and optimization.
Step 2: Multithreading
Having made some initial optimizations, I realized that to achieve my target execution time, more drastic changes were needed. Noticing the possibility for concurrent operations in my code, I decided to implement multithreading. The goal here was to better utilize the multiple cores of modern CPUs for parallel execution and thus drastically cut down execution time.
How: Multithreading splits the tasks of a program across multiple concurrent threads of execution. Modern CPUs contain multiple cores, each capable of running a separate thread. By leveraging this feature, I was able to run several tasks concurrently, greatly speeding up total execution time.
Why: Multithreading can greatly enhance the performance of a program by enabling it to do more work concurrently. Essentially, it can change time-complexity from O(n) to O(n/m), where 'n' is the number of tasks and 'm' is the number of threads. However, it's crucial to manage these threads appropriately to avoid concurrency problems like race conditions.
Impact: Post multithreading, the compiler-reported compile time increased to 108042 ms, but the actual time was around 28000 ms. This discrepancy was due to each thread having its own clock, leading to an inaccurate total elapsed time.
Step 3: Multithreading Clock Synchronization and Miscellaneous Optimizations
Seeing the discrepancy between real and reported time, I decided to address this issue by rewriting the time measurement function to use a synchronized clock. In addition, I started looking for potential performance bottlenecks and inefficient operations within the code.
How: I rewrote the function to use std::chrono::steady_clock, a high-resolution "common clock", to accurately measure time across all threads. Then, I went through the code, identifying and improving expensive operations, reducing function calls, and optimizing the algorithmic complexity.
Why: Correct time measurement is crucial to accurately gauge the performance improvement from multithreading. By using a common clock, we ensure that all threads follow the same timeline. Further, small inefficiencies can add up over time, especially in iterative operations or recursive calls, and reducing these inefficiencies can lead to noticeable improvements in performance.
Impact: After these changes, the runtime was further reduced to 26217 ms.
Step 4: Thread Pool and Precompiled Headers (PCH)
Still aiming for my goal of 5000ms, I decided to implement a thread pool to eliminate the overhead of creating and destroying threads for each new task. I also decided to use precompiled headers (PCH) to reduce compile time.'
How: A thread pool is a group of pre-instantiated, idle threads that can be used to execute tasks. PCHs are a feature of the C++ compiler that allow it to compile certain parts of the code (like library headers) once, and then reuse the result in subsequent compilations.
Why: Thread pools reduce the overhead associated with creating and destroying threads, thereby improving efficiency. PCHs reduce compile time by avoiding redundant compilations of unchanged parts of the code.
Impact: However, even with these enhancements, the runtime was still around 26000 ms. So...
Step 5: Memory Pooling and Spatial Partitioning Data Structures
This is where I thought it would end... that I would fail. And not achieve my goal. I used the unique assumptions and designs of the raycasting system to understand the underlying architecture and intricacies of the problem. By delving into the specific constraints and opportunities within the raycasting framework, I realized that the solution lay in optimizing memory management through memory pooling and implementing spatial partitioning data structures like Octrees and AABBs.
How: Memory pooling involves creating a "pool" of memory blocks at the start of a program. These blocks can then be assigned and reclaimed as necessary. Octrees partition 3D space into eight recursive quadrants, while AABBs offer an efficient way to detect potential intersections in a 3D space.
Why: Proper memory management is crucial to the performance of a program. By utilizing memory pooling, overheads related to memory allocation and deallocation can be minimized. Efficient data structures like Octrees and AABBs can reduce the complexity of operations, leading to a faster and more efficient program.
Impact: After these implementations, the performance greatly improved, with the runtime drastically reducing to less than one second. This was a whopping 96% improvement in runtime, successfully meeting my target.
100042 ms to 2,000 ms. Through this project I found my love of optimization. I find it so interesting taking a peak behind the curtain, and then taking a peak behind that curtain, and then taking a peak behind ... you get the picture.