Difference between revisions of "ECE-QE-CE-4-2015" - Rhea

Revision as of 21:23, 15 April 2017

ECE Ph.D. Qualifying Exam
Computer Engineering, Question 04 (CE-4): Architecture
August 2015

Please note: this page is still being currently updated and modified as of April 15, 2017.

I. Processor

In modern out-of-order processors pipelines, there are many "pipeline loops" due to data/information generated in later pipeline stages and used in earlier stages (e.g., register bypass, branch resolution, and rename table update). Modern pipelines have many such loops each of which is relevant only to a subset of instructions.

Assume single-issue pipeline for the following questions.

1. What is the increase in the cycles per instruction (CPI), if 80% of instructions incur a loop from the 8th stage to the 8th stage? Also, give a real example of such a loop.

Assuming each stage can be completed in one cycle: If 80% of instructions incur a one cycle/stage loop-back, this implies that stage would have to be performed again. Thus, a one cycle penalty for 80% instructions would result in a 1 * 0.8 = 0.8 CPI increase.

2. For a loop from the 15th stage back to the 1st stage, the pipeline employs prediction with accuracy 90% (you should figure out how prediction lessens the loops' impact on the CPI). What is the increase in the CPI if the loop is relevant to 20% of all instructions?

For 20% of instructions, there is a pipeline loop from the 15th to the 1st stage.
For 90% of such instructions, the pipeline can predict this behavior.

So 18% (0.2 * 0.9) of instructions loop from the 15th stage to the 1st stage in a predicted manner. This means that 2% (0.2 * 0.1) of instructions loop from the 15th stage to the 1st stage without prediction and must pay a penalty. This penalty is where the increase in CPI will originate. When a misprediction occurs, the penalty here is 15 stages/cycles which 2% of instructions must pay.

Therefore, 0.02 * 15 cycles/stages = 0.3 increase in CPI. If the correctly predicted loop back from the 15th to the 1st stage does not have to pay the penalty (I assume this here due to the nature of the correct prediction), then 0.3 is the total increase to the CPI.

3. For a loop from the 15th stage back to the 1st stage and another from the 10th stage back to the 5th stage, if 1% of instructions are affected by the former loop and 3% (including the previous 1%) are affected by the latter loop, what is the increase in CPI?

From the question we know that a loop back from the 15th to the 1st stage equates to a 15 cycle/stage penalty. A loop then from the 10th stage back to the 5th stage is a cycle/stage penalty (as stages 5, 6, 7, 8, 9, and 10 all have to be visited again).

Then we know that 1% of instructions pay the 15 stage/cycle penalty and 3% of instructions pay the 6 stage/cycle penalty.

The problem states that this 3% "includes the previous 1%." Given this, to calculate the affect of this loop separately, I assume the 3% is instead 2% (3%-1%).

Then:

1% pay the 15 stage penalty: 0.01 * 15 = 0.15
2% pay the 6 stage penalty: 0.02 * 6 = 0.12

Therefore, the total increase to the system CPI imposed by these two loop-backs is 0.27 CPI.

II. Memory systems

1. Assume 25% of all instructions are loads and stores, L1 miss rate is 20%, L2 miss rate is 20% (out of L1 misses), and L1, L2, main memory latencies are 3, 32, and 300 cycles, respectively, and their bandwidths are, respectively, 1 access (hit or miss) every 1, 8, and 30 cycles. What is the achievable IPC (instructions per cycle), if we assume abundant parallelism due to simultaneous multithreading?

2. Inverted page tables reduce page table space overhead via hashing. However, how are hashing collisions handled? Explain any policy the hardware/operating system may have to use.

III. Multicore

1. Assuming non-atomic writes, add statements to the following code to show sequential consistency (SC) violation. You may add statements to one or both threads, and either before or after the statements shown below.

Assume that before the following code segment runs X=Y=0.

Thread 1: write X = 10;
Thread 2: write Y = 20;

Show your new code and explain how SC may be violated.

2. Assume the following multithreaded code where mypid is each thread's process id. The thread runs on n cores with coherent private caches.



repeat {many times}

for i = 1 to n

A[mypid] = A[mypid] % mypid;

(a) Explain why this code is likely to perform poorly.

(b) Describe how to address this problem and show your pseudocode.

IV. Fundamentals

A high-performance architecture uses ultra-wide simultaneous multi-threaded (SMT) processors (e.g., 128-way SMT processors) to target "embarrassingly" parallel, highly regular and loop-based codes with heavy branching (e.g., highly-conditional dense matrix applications). Due to the heavy branching, vector compiler scheduling is not effective. Instead, the compiler generates simple, parallel threads from the loop iterations each of wich runs on an SMT context (e.g., 128 iterations run in parallel). While the branching disallows lock-step execution across the threads, the iterations are naturally load-balanced and access mostly sequentially-contiguous data (e.g., thread i accesses ith array elements). A key challenge is designing memory system to support such processors which employ a cache hierarchy (memory system includes cache and main memory). Assume the code has enough reuse to justify a conventional cache hierarchy.

1. Which memory-system parameters poses a key challenge for data access? Which parameters are much less relevant?

2. Based on the code characteristics listed above, what would be your strategy to address the above challenge? Describe your strategy for both caches and main memory.

3. What difficulty does the lack of lock-stepping cause your strategy? How would you tackle the difficulty?

@@ Line 11: / Line 11: @@
 <hr>
-<big>I. Processor</big>
+==<big>I. Processor</big>==
 In modern out-of-order processors pipelines, there are many "pipeline loops" due to data/information generated in later pipeline stages and used in earlier stages (e.g., register bypass, branch resolution, and rename table update). Modern pipelines have many such loops each of which is relevant only to a subset of instructions.
@@ Line 46: / Line 46: @@
 <hr>
-<big>II. Memory systems</big>
+==<big>II. Memory systems</big>==
 . Assume 25% of all instructions are loads and stores, L1 miss rate is 20%, L2 miss rate is 20% (out of L1 misses), and L1, L2, main memory latencies are 3, 32, and 300 cycles, respectively, and their bandwidths are, respectively, 1 access (hit or miss) every 1, 8, and 30 cycles. What is the achievable IPC (instructions per cycle), if we assume abundant parallelism due to simultaneous multithreading?
@@ Line 54: / Line 54: @@
 <hr>
-<big>III. Multicore</big>
+==<big>III. Multicore</big>==
 . Assuming non-atomic writes, add statements to the following code to show sequential consistency (SC) violation. You may add statements to one or both threads, and either before or after the statements shown below.
@@ Line 78: / Line 78: @@
 <hr>
-<big>IV. Fundamentals</big>
+==<big>IV. Fundamentals</big>==
 A high-performance architecture uses ultra-wide simultaneous multi-threaded (SMT) processors (e.g., 128-way SMT processors) to target "embarrassingly" parallel, highly regular and loop-based codes with heavy branching (e.g., highly-conditional dense matrix applications). Due to the heavy branching, vector compiler scheduling is not effective. Instead, the compiler generates simple, parallel threads from the loop iterations each of wich runs on an SMT context (e.g., 128 iterations run in parallel). While the branching disallows lock-step execution across the threads, the iterations are naturally load-balanced and access mostly sequentially-contiguous data (e.g., thread ''i'' accesses ''ith'' array elements). A key challenge is designing memory system to support such processors which employ a cache hierarchy (memory system includes cache and main memory). Assume the code has enough reuse to justify a conventional cache hierarchy.