RISC vs. CISC (reduced vs. complex instruction set computer) simple, single clock instructions emphasis on software complexity general purpose registers vs. complex, multi clock instructions emphasis on hardware complexity special purpose registers MIPS Instruction formats R [opcode | rs | rt | rd | shamt | funct] [6|5|5|5|5|6] I [opcode | rs | rt | immediate ] [6|5|5|16] J [opcode | address ] [6|26] Register conventions (usually described in questions) $s[0-7] = saved register, kind of like a local variable $t[0-9] = temporary register, intermediate variables $a[0-3] = argument registers $v[0-1] = result registers $zero $at = assembler temporary, used for expanding pseudo instructions $sp = stack pointer $ra = return address (s registers are saved across function calls) Important Instructions Arithmetic Logic add[i]?[u]? sub[u]? and[i] or[i]? sll, slr lui - load upper immediate, important. slt[i]?[u]? Branches beq, bne j, jal, jr Load/Stores lb, lbu, lw sb, sw Pseudo-instructions... exist. Basically, they are macros that expand to multiple instructions using $at to simplify programming a little bit. What's with the unsigned versions ([u]?)? Do or do not sign extend (lb,lbu) Do or do not trigger overflow flags ( add, sub, mult, div) Do or do not use signed comparators (slt) Pipelining 5 stage RISC CPU Fetch Decode Execute Memory Writeback Main parts are instruction fetch unit, register file, alu, memory unit, whole bunch of muxes Latency vs Throughput Time to complete a single task, tasks completed per unit time Hazards structural - HW cannot support some combination of instructions control - pipelining of branches makes instruction fetch wrong data - need result of an instruction still in the pipeline Forwarding if data is available but hasn't made it through the pipeline -> it can be "forwarded" to the appropriate piece of hardware when necessary Delay Slots branch and load delay slots in the semantics of the language always execute the instruction after a branch or load (because forwarding cannot solve all dependencies) Memory Hierarchy Pyramid going from registers, L1 cache, etc. all the way to network disk/tape Higher layers are faster, smaller, more expensive Lower layers are slower, bigger, cheapers Cache Locality Spatial and Temporal Geometry Data Size = volume Block size = width Associativity = depth (Direct Mapped, N-way Set Associative, Fully Associative) Virtual Address = [Tag|Set Index|Offset] DataSize = Associativity*2^(I + O) BlockSize = 2^O T = WordSize - I - O Write Through vs. Write Back Replacement Policy (only matters within a single set) LRU - least recently used, boned if your accessing stuff slightly bigger than your array Random - boned if you are reusing certain parts of memory more often L1 Cache physically addressed (this varies, but the standard one from book follows this) Virtual Memory Separate the programs address space from the machine address space Allows programs to share Allows programs separation Allows larger virtual address spaces than you have physical memory (With paging, allows you to keep only a working set of each programs address space) Base and Bounds Simple method where you simply add a fixed offset to virtual addresses sucks at expanding allocation and sharing between processes Paging as just another level of caching Vaddr = [VPN | PageOffset] Use VPN to index into OS maintained PageTable (various organizations of these, one per process usually) Page Table Entry = [Resident | Dirty | Referenced | Permissions | PPN] Paddr = [PPN | PageOffset] TLB (translation lookaside buffer) Cache of the page table, must be flushed on context switches vaddr = [TLB tag | TLB index | page offset] TLB entry = [tag|ppn|dirty|valid|referenced|access rights] Performance and Optimization AMAT = average memory access time = hit rate*hit time + miss rate * miss time CPI = sum of cycles per instruction type (load/store, arithmetic, etc.) weighted by frequency Loop unrolling - makes code bigger, less overhead and branching ----- Other fun stuff, has not been on recent tests ----- Number Representation Unsigned - Binary One's Complement - Leading 1 means negative, invert all bits to get magnitude Two's Complement - Leading 1 means negative, invert all bits and add 1 to get magnitude Floating Point - [sign | exponent | fraction] bias = 2^(E-1) - 1 = 011...1 Normalized: (-1)^sign (1 + fraction)x2^(exponent - bias) Denormalized: (-1)^sign (0 + fraction)x2^(1-bias) max exponent (11...1) => infinity if fraction = 0, NaN (not a number) otherwise min exponent (00...0) => denorm I/O Memory mapped I/O Certain memory locations are "special" Basically, you have a set of two "mailboxes" for each direction Control Box is a ready + interrupt enable bits -I: ready means data is ready to be read by CPU -O: ready means data is ok to be written by CPU Data Box is just a single byte of transfer DMA is more realistic addr stack ff..ff | v ^ | heap code data 0 Heap Management Free list managed by Best, first, next fit. For requests lower than a certain size, use Slab Allocator fixed version of buddy allocator Buddy Allocator