# Chapter 4 ## Flynn's Taxonomy (parallel operation types) * Single Instruction Single Data (SISD) * ![SISD Diagram](sisd_diagram.png) * Multiple Instruction Multiple Data (MIMD) * ![MIMD Diagram](mimd_diagram.png) * Single Instruction Multiple Data (SIMD) * ![SIMD Diagram](simd_diagram.png) * Multiple Instruction Single Data (MISD) * ![MISD Diagram](misd_diagram.png) ## Pipelining * Instruction level parallelism (ILP) * Non-pipelined CPU execution: * ![Non-pipelined CPU](no_pipeline.png) * Non-pipelined CPU execution: * ![Pipelined CPU](pipeline.png) ## Branch Prediction in games consoles * PS3 Cell prcessor had poor branch prediction * PS4 and Xbox ONE AMD Jaguar has advanced branch prediction hardware ## Fast branch-avoidant code for safe floating point division ### (handle divide-by-zero) ``` int SafeFloatDivide_pred(float a, float b, float d) { // convert Boolean (b != 0.0f) into either 1U or 0U const unsigned condition = (unsigned)(b != 0.0f); // convert 1U -> 0xFFFFFFFFU // convert 0U -> 0x00000000U const unsigned mask = 0U - condition; // calculate quotient (will be QNaN if b == 0.0f) const float q = a / b; // select quotient when mask is all ones, or default // value d when mask is all zeros (NOTE: this won't // work as written -- you'd need to use a union to // interpret the floats as unsigned for masking) const float result = (q & mask) | (d & ~mask); return result; } ``` This technique for selecting one of two possible values is called predication or a select operation ## Very Long Instruction Word (VLIW) CPU Design * The CPU is made up of multiple of the same compute element (ALU, FPU, etc.) * Each "instruction" is comprised of 2 or more instructions, one for each compute element * Example of this is PS2: The vector units VU0 and VU1 can dispatch two instructions per cycle each ## Multi Core CPUs * PS4 and Xbox One both contain multicore CPUs * Each has an accelerated processing unit (APU) * two quad-core AMD Jaguars in a single die with a GPU, memory controller, and video codec * Of the 8 total cores, 7 are available to game applications * PS4 Architecture: ![PS4 CPU arch](ps4_arch.png) * XBox One Architecture: ![Xbox one CPU arch](xboxone_arch.png) ## Symmetric vs Assymetric Parallelism * Symmetric CPUs - all cores are copies of the same hardware; threads run the same on each core; threads can have core affinities. PS4/Xbox One are examples of this, with 8 equal cores * Asymmetric CPUs - cores can be different and are treated differently by the OS. A master core runs the OS and other cores are slaves. Example of this is PS3's Cell Broadband Engine (PPU,SPUs) ## Console Shared Libraries * The PS4 OS uses shared libraries called PRX (name from PS3 - PPU Relocatable Executable) ## Fibers * Provide cooperative multitasking - essentially threads except the programmer has explicit control over when they are scheduled compared to other Fibers - for when you want to write your own job system for your program ## Mutexes vs Semaphores * A mutex has a binary state: locked or unlocked * A sempahore is a counter with an integer value of N, allowing a max of N threads to use it at once; when the semaphore value is 0 any requesting threads are blocked. A nonzero value means the semaphore is "signaled" (true/allow) and a zero value is "nonsignaled" (block/forbid) * A mutex can only be unlocked by the thread that locked it; * A semaphore can be be decremented/incremented by different threads; thus is more commonly used as a signal * Semaphore example: ``` Queue g_queue; sem_t g_semUsed; // initialized to 0 sem_t g_semFree; // initialized to 1 void* ProducerThreadSem(void*) { // keep on producing forever... while (true) { // produce an item (can be done non- // atomically because it's local data) Item item = ProduceItem(); // decrement the free count // (wait until there's room) sem_wait(&g_semFree); AddItemToQueue(&g_queue, item); // increment the used count // (notify consumer that there's data) sem_post(&g_semUsed); } return nullptr; } void* ConsumerThreadSem(void*) { // keep on consuming forever... while (true) { // decrement the used count // (wait for the data to be ready) sem_wait(&g_semUsed); Item item = RemoveItemFromQueue(&g_queue); // increment the free count // (notify producer that there's room) sem_post(&g_semFree); // consume the item (can be done non- // atomically because it's local data) ConsumeItem(item); } return nullptr; } ``` ## C++ Atomic * Protects variables without the need for a mutex via CPU specific instructions (normally) ``` std::atomic g_data; std::atomic_flag g_ready = false; void ProducerThread() { // produce some data g_data = 42; // inform the consumer g_ready = true; } void ConsumerThread() { // wait for the data to be ready while (!g_ready) PAUSE(); // consume the data ASSERT(g_data == 42); } ``` ## Testing if a lock is needed * If it is known beforehand, due to code architecture, that a race condition probably won't happen, you can use a volatile boolean (so the compiler doesnt optimize it out) to assert that the overlapping race condition never happens. This code can then be defined out for production builds * Has been used and been successful at Naughy Dog ``` class UnnecessaryLock { volatile bool m_locked; public: void Acquire() { // assert no one already has the lock assert(!m_locked); // now lock (so we can detect overlapping // critical operations if they happen) m_locked = true; } void Release() { // assert correct usage (that Release() // is only called after Acquire()) assert(m_locked); // unlock m_locked = false; } }; #if ASSERTIONS_ENABLED #define BEGIN_ASSERT_LOCK_NOT_NECESSARY(L) (L).Acquire() #define END_ASSERT_LOCK_NOT_NECESSARY(L) (L).Release() #else #define BEGIN_ASSERT_LOCK_NOT_NECESSARY(L) #define END_ASSERT_LOCK_NOT_NECESSARY(L) #endif // Example usage... UnnecessaryLock g_lock; void EveryCriticalOperation() { BEGIN_ASSERT_LOCK_NOT_NECESSARY(g_lock); printf("perform critical op...\n"); END_ASSERT_LOCK_NOT_NECESSARY(g_lock); } ```