Main Memory & DRAM

4m ago
2.23 MB
41 Pages
Last View : Today
Last Download : n/a
Upload by : Farrah Jaffe

Spring 2018 :: CSE 502Main Memory& DRAMNima Honarmand

Spring 2018 :: CSE 502Main Memory — Big Picture1) Last-level cache sends its memory requests to a MemoryController– Over a system bus of other types of interconnect2) Memory controller translates this request to a bunch ofcommands and sends them to DRAM devices3) DRAM devices perform the operation (read or write) andreturn the results (if read) to memory controller4) Memory controller returns the results to LLCSystem BusLLCMemoryControllerMemory Bus

Spring 2018 :: CSE 502SRAM vs. DRAM SRAM Static RAM– As long as power is present, data is retained DRAM Dynamic RAM– If you don’t refresh, you lose the data even with powerconnected SRAM: 6T per bit– built with normal high-speed VLSI technology DRAM: 1T per bit 1 capacitor– built with special VLSI process optimized for density

Spring 2018 :: CSE 502Memory Cell StructureswordlinewordlineTrench Capacitor(less common)bbSRAMStacked Capacitor b(more common)DRAM

Spring 2018 :: CSE 502DRAM Cell ArraydecoderRowAddressColumnAddressRow BuffermultiplexorDRAM is much denser than SRAMSenseAmps

Spring 2018 :: CSE 502DRAM Array Operation Low-Level organization is very similar to SRAM Reads are destructive: contents are erased by reading Row buffer holds read data– Data in row buffer is called a DRAM row Often called “page” – do not confuse with virtual memory page– Read gets entire row into the buffer– Block reads always performed out of the row buffer Reading a whole row, but accessing one block Similar to reading a cache line, but accessing one word

Spring 2018 :: CSE 502Destructive Read10sense amp outputVddVddbitlinevoltageWordline EnabledVddSense Amp EnabledWordline EnabledcapacitorvoltageSense Amp EnabledVddAfter read of 0 or 1, cell contents close to ½

Spring 2018 :: CSE 502DRAM Read After a read, the contents of the DRAM cell are gone– But still “safe” in the row buffer Write bits back before doing another read Reading into buffer is slow, but reading buffer is fast– Try reading multiple lines from buffer (row-buffer hit)DRAM cellsSense AmpsRow BufferProcess is called opening or closing a row

Spring 2018 :: CSE 502DRAM Refresh (1) Gradually, DRAM cell loses contents– Even if it’s not accessed– This is why it’s called “dynamic”01 DRAM must be regularly read and re-written– What to do if no read/write to row for long time?capacitorvoltageVddLong TimeMust periodically refresh all contents

Spring 2018 :: CSE 502DRAM Refresh (2) Burst Refresh– Stop the world, refresh all memory Distributed refresh– Space out refresh one (or a few) row(s) at a time– Avoids blocking memory for a long time Self-refresh (low-power mode)– Tell DRAM to refresh itself– Turn off memory controller– Takes some time to exit self-refresh

Spring 2018 :: CSE 502Typical DRAM Access Sequence (1)

Spring 2018 :: CSE 502Typical DRAM Access Sequence (2)

Spring 2018 :: CSE 502Typical DRAM Access Sequence (3)

Spring 2018 :: CSE 502Typical DRAM Access Sequence (4)

Spring 2018 :: CSE 502Typical DRAM Access Sequence (5)

Spring 2018 :: CSE 502(Very Old) DRAM Read TimingOriginal DRAM specified Row & Column every time

Spring 2018 :: CSE 502(Old) DRAM Read Timing w/ Fast-Page ModeFPM enables multiple reads from page without RAS

Spring 2018 :: CSE 502(Newer) SDRAM Read TimingSDRAM: Synchronous DRAMDouble-Data Rate (DDR) SDRAMtransfers data on both rising andfalling edge of the clockSDRAM uses clock, supports bursts

Spring 2018 :: CSE 502From DRAM Array to DRAM Chip (1) A DRAM chip is one of the ICs you see on a DIMM– DIMM Dual Inline Memory Module DRAM Chip Typical DIMMs read/write memory in 64-bit (dword) beats Each DRAM chip is responsible for a subset of bits in eachbeat– All DRAM chips on a DIMM are identical and work in lockstep The data width of a DRAM chip is the number of bits itreads/writes in a beat– Common examples: x4 and x8

Spring 2018 :: CSE 502From DRAM Array to DRAM Chip (2) Each DRAM Chip is internally divided into a numberof Banks– Each bank is basically a fat DRAM array, i.e., columns aremore than one bit (4-16 are typical) Each bank operates independently from otherbanks in the same device Memory controller sends the Bank ID as the higherorder bits of the row address

Spring 2018 :: CSE 502Banking to Improve BW DRAM access takes multiple cycles What is the miss penalty for 8 cache blocks?– Consider these parameters: 1 cycle to send address 10 cycle to read the row containing the cache block 4 cycles to send-out the data (assume DDR)– ( 1 10 4) x 8 120 How can we speed this up?

Spring 2018 :: CSE 502Simple Interleaved Main Memory Divide memory into n banks, “interleave” addresses acrossthem so that cache-block A is– in bank “A mod n”– at block “A div n”Bank 0Bank 1Bank 2Bank nBlock 0Block nBlock 2nBlock 1Block n 1Block 2n 1Block 2Block n 2Block 2n 2Block n-1Block 2n-1Block 3n-1Physical Address:Block in bankBank Can access one bank while another one is busy

Spring 2018 :: CSE 502Banking to Improve BW In previous example, if we had 8 banks, how longwould it take to receive all 8 blocks?– (1 10 4) 7 4 43 cycles Interleaving increases memory bandwidthw/o a wider busUse parallelism in memory banks to hide memory latency

Spring 2018 :: CSE 502DRAM OrganizationDRAMDRAMDRAMDRAMDRAMDRAMDRAMDRAMAll banks within therank share all addressand control pinsx8 DRAMBankAll banks are independent,but can only talk to onebank at a timeDIMMDRAMDRAMDRAMDRAMDRAMDRAMx8 means each DRAMoutputs 8 bits, need 8chips for DDRx (64-bit)x8 DRAMDRAMDRAMDRAMDRAMWhy 9 chips per rank?64 bits data, 8 bits ECCRankDual-rank x8 (2Rx8) DIMM

Spring 2018 :: CSE 502SDRAM Topology

Spring 2018 :: CSE 502CPU-to-Memory Interconnect (1)North Bridge can beIntegrated onto CPUchip to reduce latencyFigure from ArsTechnica

Spring 2018 :: CSE 502CPU-to-Memory Interconnect (2)CPUNorthBridgeSouthBridgeDiscrete North and South Bridge chips (Old)

Spring 2018 :: CSE 502CPU-to-Memory Interconnect (3)SouthBridgeCPUIntegrated North Bridge (Modern Day)

Spring 2018 :: CSE 502Memory ChannelsOne controllerOne 64-bit channelMem ControllerOne controllerTwo 64-bit channelsMem ControllerTwo controllersTwo 64-bit channelsCommandsDataMem ControllerMem ControllerUse multiple channels for more bandwidth

Spring 2018 :: CSE 502Memory-Level Parallelism (MLP) What if memory latency is 10000 cycles?– Runtime dominated by waiting for memory– What matters is overlapping memory accesses Memory-Level Parallelism (MLP):– “Average number of outstanding memory accesses when atleast one memory access is outstanding.” MLP is a metric– Not a fundamental property of workload– Dependent on the microarchitecture With high-enough MLP, you can hide arbitrarily largememory latencies

Spring 2018 :: CSE 502AMAT with MLP If cache hit is 10 cycles (core to L1 and back)memory access is 100 cycles (core to mem and back) Then at 50% miss ratio: AMAT 0.5 10 0.5 100 55 Unless MLP is 1.0, then at 50% mr, 1.5 MLP: AMAT (0.5 10 0.5 100)/1.5 37at 50% mr, 4.0 MLP: AMAT (0.5 10 0.5 100)/4.0 14In many cases, MLP dictates performance

Spring 2018 :: CSE 502Memory Controller From CPUSchedulerChannel 0BufferChannel 1MemoryController

Spring 2018 :: CSE 502Memory Controller (2) Memory controller connects CPU and DRAM Receives requests after cache misses in LLC– Possibly originating from multiple cores Complicated piece of hardware, handles:– DRAM Refresh– Row-Buffer Management Policies– Address Mapping Schemes– Request Scheduling

Spring 2018 :: CSE 502Request Scheduling in MC (1) Write buffering– Writes can wait until reads are done Controller queues DRAM commands– Usually into per-bank queues– Allows easily reordering ops. meant for same bank Common policies:– First-Come-First-Served (FCFS)– First-Ready—First-Come-First-Served (FR-FCFS)

Spring 2018 :: CSE 502Request Scheduling in MC (2) First-Come-First-Served– Oldest request first First-Ready—First-Come-First-Served– Prioritize column changes over row changes– Skip over older conflicting requests– Find row hits (on queued requests) Find oldest If no conflicts with in-progress request good Otherwise (if conflicts), try next oldest

Spring 2018 :: CSE 502Request Scheduling in MC (3) Why is it hard? Tons of timing constraints in DRAM– tWTR: Min. cycles before read after a write– tRC: Min. cycles between consecutive open in bank– Simultaneously track resources to prevent conflicts– Channels, banks, ranks, data bus, address bus, row buffers– Do it for many queued requests at the same time while not forgetting to do refresh

Spring 2018 :: CSE 502Row-Buffer Management Policies Open-page Policy– After access, keep page in DRAM row buffer– Next access to same page lower latency– If access to different page, must close old one first Good if lots of spatial locality Close-page Policy– After access, immediately close page in DRAM row buffer– Next access to different page lower latency– If access to different page, old one already closed Good if no locality (random access)

Spring 2018 :: CSE 502Address Mapping Schemes (1) Question: How to map a physical addr to channel ID, rankID, bank ID, row ID, column ID ?– Goal: efficiently exploit channel/rank/bank level parallelism Multiple independent channels max parallelism– Map consecutive cache lines to different channels Single channel, Multiple ranks/banks OK parallelism– Limited by shared address and/or data pins– Map consecutive cache lines to banks within same rank Reads from same rank are faster than from different ranks Accessing different rows from one bank is slowest– All requests serialized, regardless of row-buffer mgmt. policies– Rows mapped to same bank should avoid spatial locality Column mapping depends on row-buffer mgmt (why?)

Spring 2018 :: CSE 502Address Mapping Schemes (2)[ bank column 0x00E000x00F00[ column bank x00B000x00F00

Spring 2018 :: CSE 502Address Mapping Schemes (3) Example Open-page Mapping Scheme:High Parallelism:Easy Expandability:[row rank bank column channel offset][channel rank row bank column offset] Example Close-page Mapping Scheme:High Parallelism:Easy Expandability:[row column rank bank channel offset][channel rank row column bank offset]

Spring 2018 :: CSE 502Overcoming Memory Latency Caching– Reduce average latency by avoiding DRAM altogether– Limitations Capacity (programs keep increasing in size) Compulsory misses Prefetching– Guess what will be accessed next– Bring it to the cache ahead of time

Spring 2018 :: CSE 502 Simple Interleaved Main Memory Divide memory into n banks, “interleave” addresses across them so that cache-block A is –in bank “A mod n” –at block “A div n” Can access one bank while another one is busy Bank 0 Bank 1 Bank 2 Bank n Block in bank Bank Block 0 Block n Block 2n Block 1 Block n 1 Block .