32-Using Cache Memory to Reduce Processor-Memory Traffic.pdf

上传人：*** IP属地：浙江上传时间：2014-01-05 格式：PDF 页数：8 大小：950.29KB 积分：6 举报 版权申诉

32-Using Cache Memory to Reduce Processor-Memory Traffic.pdf_第2页

32-Using Cache Memory to Reduce Processor-Memory Traffic.pdf_第3页

32-Using Cache Memory to Reduce Processor-Memory Traffic.pdf_第4页

32-Using Cache Memory to Reduce Processor-Memory Traffic.pdf_第5页

已阅读5页，还剩3页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

USINGCACHEMEMORYTOREDUCEPROCESSOR-MEMORYTRAFFICJamesR.GoodmanDepartmentofComputerSciencesUniversityofWisconsin-MadisonMadison,WI58706ABSTRACT-Theimportanceofreducingprocessor-memorybandwidthisrecognizedintwodistinctsitua-tions:singleboardcomputersystemsandmicroproces-sorsofthefuture.Cachememoryisinvestigatedasawaytoreducethememory-processortraffic.Weshowthattraditionalcacheswhichdependheavilyonspatiallocality(look-ahead)fortheirperformanceareinap-propriateintheseenvironmentsbecausetheygeneratelargeburstsofbustraffic.Acacheexploitingprimarilytemporallocality(look-behind)isthenproposedanddemonstratedtobeeffectiveinanenvironmentwhereprocessswitchesareinfrequent.Wearguethatsuchanenvironmentispossibleifthetraffictobackingstoreissmallenoughthatmanyprocessorscanshareacommonmemoryandifthecachedataconsistencyproblemissolved.Wedemonstratethatsuchacachecanindeedreducetraffictomemorygreatly,andintroducee.relegantsolutiontothecachecoherencyproblem.1.IntroductionBecausetherearestraightforwardwaystocon-structpowerful,cost-effectivesystemsusingrandomaccessmemoriesandsingle-chipmicroprocessors,sem-iconductortechnologyhas,untilnow,hadthegreatestimpactthroughthesecomponents.High-performanceprocessors,however,arestillbeyondthecapabilityofasingle-chipimplementationandarenoteasilyparti-tionedinawaywhichcaneffectivelyexploitthetechnol-ogyandeconomiesofVLSI.Aninterestingphenomenonhasoccurredinthepreviousdecadeasaresultofthisdisparity.Memorycostshavedroppedradicallyandcon-sistentlyforcomputersystemsofallsizes.WhilethecomponentcostofaCPU(single-chipimplementationsexcluded)hasdeclinedsignificantlyoverthesameperiod,thereductionhasbeenlessdramatic.Aresultisthattheamountofmemorythoughttoheappropriateforagivenspeedprocessorhasgrowndramaticallyinrecentyears.Todaysmallminicomputershavememoryaslargeasthatofthemostexpensivemachinesofadecadeago.Permissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarenotmadeordistributedfordirectcommercialadvantage,theACMcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheAssociationforComputingMachinery.Tocopyotherwise,ortorepublish,requiresafeeand/orspecificpermission.TheimpactofVLSIhasbeenverydifferentinmicroprocessorapplications.Herememoryisstillregardedasanexpensivecomponentinthesystem,andthosefamiliarprimarilywithaminicomputerormain-frameenvironmentareoftenscornfulofthetroubletowhichmicroprocessorusersgotoconservememory.Thereason,ofcourse,isthateventhesmallmemoryinamicroprocessorisamuchlargerportionofthetotalsys-temcostthanthemuchlargermemoryonatypicalmainframesystem.Thisresultsfromthefactthatmemoryandprocessorsareimplementedinthesametechnology.1.1.ASuperCPUWiththeadvancestoVLSIoccurringnowandcon-tinuingoverthenextfewyears,itwillbecomepossibletofabricatecircuitsthatareonetotwoordersofmagni-tudemorecomplexthancurrentlyavailablemicropro-cessors.Itwillsoonbepossibletofabricateanextremelyhigh-performanceCPUonasinglechip,IftheentirechipisdevotedtotheCPU,however,itisnotagoodidea.Extrapolatinghistoricaltrendstopredictfuturecomponentdensities,wemightexpectthatwithinafewyearsweshouldbeabletopurchaseasingle-chipprocessorcontainingatleasttentimesasmanytransis-torsasoccurin,say,theMC68000.FortheempiricalruleknownasGroschslawGrosch53,P=kCg,wherePissomemeasureofperformance,Cisthecost,andkandgareconstants,KnightKnight66concludedthatgisatleast2,andSolomonSolomon66hassuggestedthatg1.47.FortheIBMSystem/B70family,Siewiorekdeterminedthatgal.8Siewiorek82.WhileGroschslawbreaksdowninthecomparisonofprocessorsusingdifferenttechnologyorarchitectures,itisrealisticforpredictingimprovementswithinasingletechnology.Siewiorekinfactsuggeststhatitholdsbydefinition.Assumingg=1.5andusingprocessor-memorybandwidthasourmeasureofperformance,Groschslawpredictsthataprocessorcontainingi0timesasmanytransistorsasacurrentmicroprocessorwouldrequire30timesthememorybandwidth.*TheMotorolaMC68000,runningat10MHz,accessesdatafrommemoryatamaximumrateof5millionbytespersecond,usingmorethanhalfitspinstoachievethisrate.Althoughpackag-ingtechnologyisrapidlyincreasingthepinsavailabletoachip,itisunlikelythattheincreasewillhe30-fold(the68000has84pins).Wewouldsuggestafactoroftwoisrealistic.Althoughsometechniquesareclearlypossibletoincreasethetransferrateintoandoutofthe68000,supplyingsuchaprocessorwithdataasfastasneededisasevereconstraint.Oneofthedesignersofthe88000,hasstatedthatallmodernmicroprocessors-the680001Thisisaconservativeestimate,infact,becauseitignorespredictabledecreasesinatedelays.1983ACM0149-7111/83/0600/0124501.00124included-arealreadybus-limitedTredennickB2.1.2.On.-chipMemoryOnealternativeforincreasedperformancewithoutproportionatelyincreasingprocessor-memorybandwidthistointroducememoryonthesamechipwiththeCPU.Withtheabilitytofabricatechipscontainingonetotwomilliontransistors,itshouldbepossible-usingonlyaportionofthechip-tobuildaprocessorsignificantlymorepowerfulthananycurrentlyavailablesingle-chipCPU.WhiledevotingtheentirechiptotheCPUcouldresultinastillmorepowerfulprocessor,introducingon-chipmemoryoffersareductioninmemoryaccesstimeduetotheinherentlysmallerdelaysascomparedtointer-chipdatatransfers.Ifmostaccesseswereon-chip,itmightactuallyperformasfastasthemorepowerfulprocessor.Ideally,thechipshouldcontainasmuchmemoryastheprocessorneedsformainstorage.Conventionalwisdomtodaysaysthataprocessorofthespeedofcurrentmicroprocessorsneedsatleast1/megabytesofmemoryLindsayB1.Thisiscertainlymorethanisfeasibleon-chip,thoughahighperformanceprocessorcouldprobablyusesubstantiallymorethanthat.ClearlyalltheprimarymemoryfortheprocessorcannotbeplacedonthesamechipwithapowerfulCPU.Whatisneededisthetopelementofamemoryhierarchy,1.3.CacheMemoryTheuseofcachememory,however,hasoftenaggra-vatedthebandwidthproblemratherthanreduceit.SmithSmithB2saysthatoptimizingthedesignhasfourgeneralaspects:(1)maximizingthehitratio,(2)minimizingtheaccesstimetodatainthecache,(3)minimizingthedelayduetoamiss,and(4)minimizingtheoverheadsofupdatingmainmemory,maintainingmulticacheconsistency,etc.Theresultisoftenalargerburstbandwidthrequirementfrommainstoragetothecachethanwouldbenecessarywithoutacache.Forexample,thecacheontheIBMSys-tem/370model16B,iscapableofreceivingdatafrommainmemoryatarateof100megabytespersecondIBM76,ItsuppliesdatatotheCPUatlessthan1/3thatrate.Thereasonisthattoexploitthespatiallocal-ityinmemoryreferences,thedatatransferredfrombackingstoreintothecacheisfetchedinlargeblocks,resultinginrequirementsofveryhighbandwidthburstsofdata.WehavemeasuredtheaveragebandwidthonanIBMSystem/370model155,andconcludedthattheflyer-agebacking-store-to-cachetrafficislessthanthecaehe-to-CPUtraffic.Thedesignofcachememoryformini-computersdemandedgreaterconcernforbusbandwidth.ThedesignersofthePDP-11models80and70clearlyrecog-nizedthatsmallblocksizeswerenecessarytokeepmainmemorytraffictoaminimumBell78.Loweringthebandwidthfrombackingstoretothecachecanbeaccomplishedinoneoftwoways:(1)smallblocksofdataarebroughtfrombackingstoretothecache,or(2)longdelaysoccurwhileablockisbeingbroughtin,independentof(andinadditionto)theac-cesstimeofthebackingstore.Whileitispossibletobringinthewordrequestedini-tially(readthrough),thusreducingthewaitonagivenreference,thelowbandwidthmemoryinterfacewillremainbusylongaftertheinitialtransferiscompleted,resultinginlongdelaysifasecondbackingstorageoperationisrequired.Wethereforehaveexploredtheeffectivenessofacachewhichexploitsprimarilyorexclusivelytemporallocality,i.e.,theblocksfetchedfrombackingstoreareonlythesizeneededbytheCPU(orpossiblyslightlylarger).Inconsideringwaystoevaluatethisstrategy,weidentifiedacommercialenvironmentthatcontainedmanyofthesameconstraintsandseemedamenabletothesamekindsofsolutions.Thisenvironmentisthemarketplaceofthesingle-boardcomputerrunningonastandardbussuchasMultibusorVersabus.sWehavechosentostudythisenvironmentinanattempttogaininsightintotheoriginal,generalscheme.2.TheSingleBoardComputerApplicationAsingleboardcomputertypicallycontainsamicroprocessorandasubstantialamountofmemory,thoughsmallenoughthatitmustbeusedcarefully.Ifneeded,accesstoadditionalrandomaccessmemoryisthroughthebus,whichisdesignedforgeneralityandsimplicity,notforhighperformance.Multibus,inpartic-ular,wasdefinedintheearly70stoofferaninexpensivemeansofcommunicationamongavarietyofsub-systems.AlthoughoriginallyintroducedbyIntelCor-poration,ithasfoundwideacceptance,havingbeenpro-posed-inaslightlymodifiedform-astheIEEEP796busstandard/EEEB0.Currently,severalhundredven-dorsofferMultibus-compatiblecards.Whilethemarkethasrapidlydevelopedforproductsusingthisbus,itsapplicationsarelimitedbythesevereconstraintimposedbythebandwidthofMultibus.Clearlythebusbandwidthcouldbeincreasedbyincreas-ingthenumberofpins,andbymodifyingtheprotocol.Itsbroadpopularityandtheavailabilityofcomponentstoimplementitsprotocolmean,however,thatitislikelytosurvivemanyyearsinitspresentform.Thusalargemarketexistsforacomputer-on-a-cardwhich,muchasifitwereallonasinglechip.hasseverelimitationsonitscommunicationswiththerestofthesystem.WedecidedtodetermineifacachememorysystemcouldbeimplementedeffectivelyintheMultibusenvironment.Tothatendwehavedesignedacachetobeusedwithacurrent-generationmicroprocessor.Inaddition,wehavedoneextensivesimulationofcacheperformance,drivenbymemorytracedata.WehaveidentifiedanewcomponentwhichisparticularlysuitedforVLSIimplementationandhavedemonstrateditsfeasibilitybydesigningitRavishankarB3.Thiscom-ponent,whichimplementsthetagmemoryforadynamicRAMcacheintendedforamicroprocessor,issimilarinmanyrespectstotherecentlyannouncedTMS2150Tree.Multibussystemshavegenerallydealtwiththeprob-lemoflimitedbusbandwidthbyremovingmostoftheprocessor-memoryaccessesfromthebus.Eachproces-sorcardhasitsownlocalmemory,whichmaybeaddressabletoothersthroughtheMultibus.Whilethisapproachhasmuchincommonwithours,webelievethattheallocationofmemory-localorremote-shouldbehandledbythesystem,freeingtheprogrammerofthistask.IntypicalMultibusapplications,considerableeffortisexpendedguaranteeingthattheprogramrun-ningisprimarilyresidenton-board.Thisapproachisviableforastaticpartitioningoftasks.Resultstodatehavebeenmuchlesssatisfactory,however,forthemoregeneralsituationwhereanumberofprocessorsaredynamicallyallocated.(Forefficiencyreasonsitalsoprecludestheuseofsharedcodesegments).eMultibusisatrademarkofIntelCorporation.SVersabusisatrademarkofMotorola.125Inmanyenvironments,asimplp,dynamichardwareallocationschemecanefficientlydeterminewhatmemorylocationsarebeingaccessedfrequentlyandshouldthereforebekeptinlocalmemory-betterthantheprogrammerwhooftenhaslittleinsightintothedynamiccharacteristicsofhisprogram.Thereareenvironmentswheretheprogrammerisintimatelyfami-liarwiththebehaviorofhisprogramandcangeneratecodetotakeadvantageofit.Inthisenvironmentthetimespentrunningaprogramisoftenmuchmoresub-stantialthanthetimedevelopingtheprogram.Thisexplains,forexample,whyaninvisiblecacheisnotappropriateontheCRAY-1.Webelievethatfreeingtheprogrammerfromconcernaboutmemoryallocationisessentialwhereprogrammerproductivityiscritical.B.i.ASingle-BoardComputerwithCacheToevaluateourapproach,weproposedasingle-boardcomputercontaining,(possiblyalongwithotherthings)aCPUandnolocalmemoryexceptacache,withbackingstoreprovidedthroughMultibus.Thuswepickedanimportantprobleminitsownright:CanwebuildacachethatworkswithaMultibussystemsupport-ingmultipleprocessors?Inparticular,howmanypro-cessorscanwesupportrunninginparallelonMultibus?Webelievethatasystemwhichcouldreasonablysupportfiveto10processorswouldbeasignificantadvance.ThiscantbecompareddirectlyagainstcurrentsystemsbecauseasingleprocessoroverloadstheMultibus.Thuslocalmemoriesmustbeheavilyexploitedifperformanceisimportant.EarlieranalysesKaplan73,Bell74,Rao78,Pate182haveusedthecachehitratioorsomethingcloselyrelatedtomeasureperformance.Theimportantcri-terionhereistomaximizeuseofthebus,notthehitratio,orevennecessarilytooptimizeprocessorperfor-mance.Weoptimizesystemperformancebyoptimizingbusutilization,achievinghigherperformancebyminim-izingindividualprocessorsbusrequirements,andtherebysupportingmoreprocessorsreasonablywell.Weallowindividualprocessorstositidleperiodicallyratherthantieupthebusfetchingdatawhichtheymightnotuse.Thisimpliesthatthecachestaledataproblemmustbesolvedeffectively.Wepresentanewsolutioninsection3.2.2.SwitchingContextsWherebusbandwidthislimited,ataskswitchisamajordisturbance,sincethecachemusteffectivelybereloadedatthistime.Theprocessorismomentarilyreducedtoaccessesattherateatwhichthebuscansupplythem.Whilethisproblemseemsunavoidable,itneednotbeseriousiftaskswitchingisminimized.Weareprovidinganenvironmentwhichallowsmanyproces-sorstoworkoutofasinglemonolithicmemoryinparal-lel.Ifmoreparalleltasksarerequired,moreprocessorscanbeused.WepointoutthatthecurrentMultibusalternativeistomovetheprogramintolocalmemory,anoperationwhichalsoswampsthebus.Thetaskswitchmerelymakesthisoperationimplicit,andavoidsbring-ingacrossthebusdatawhichareneveractuallyused.Writingtheolddataoutisalsonoworsethanthealterna-tive,sinceweonlywritethatwhichhasbeenchangedandwhichhasnotbeenalreadypurged.Theremaybecertaincases-aninterrupthandlingprogram,forexample-whereaparticularprogramdoesnotflushthecache,butusesonlyasmallportionofit.Provisionscouldbemadetoallowsuchaprogramtobelockedinthecache.Alternatively,aseparatecachemightbeprovidedforsuchaprogramOurstudiesindi-catethatarelativelysmallcachecanbeeffectiveforasingleprogram,soitmaybepossibletokeepseparatecachesaroundforindividualprocessesifthenumberissmall.Wewouldsu&g；:,；,tmgthisonestepfurtherandprovidinganadditionalprocessorforeachcache.Aninterestingquestionthenarisesastothecostofdynami-callyassigningprocessestoprocessors.Ourproposalallowsthisassignment,thoughclearlyatsomeperfor-mancepenalty.3.CacheCoherencyItiswell-knownthatmultiplecachespresentseriousproblemsbecauseoftheredundancyofstorageofasin-glelogicalmemorylocationTang76,Censier78,Rao78.Themostcommonmethodamongcommercialproductsfordealingwiththis,thestaledataproblem,istocreateaspecial,high-speedbusonwhlehaddressesaresentwheneverawriteoperationisperformedbyanyproces-sor.ThissolutionhasweaknessesCeLisier78whichhavegenerallylimitedcommercialimplementationstotwoprocessors.Inthesingle-chipprocessororsingle-boardcomputerenvironments,ithastheaddedweak-nessthatitrequiresanumberofextraI/0pins.Analternativeapproach,implementedinC,mmpHoogendoorn77andproposedbyNortonNorton82.istorequiretheoperatingsystemtorecognizewhenincon-sistenciesmightoccurandtakestepstopreventthemunderthosecircumstances.Thissolutionisunappealingbecausethecacheisnormallyregardedasanarchitecture-independentfeature,invisibletothesoftware.Athirdapproach,variationsofwhichhavebeenpro-posedbyCensierandFeautrierCensier78,TangTang76,WiddoesWiddoes79,andYenandFuYen82,istousesomeformoftaggedmainmemory,keepingtrackofindividualblocksinthiswaytopreventincon-sistency.Individualblocksaretemporarilydesignatedasprivateforaparticularprocessorsothatitmaymodifyitrepeatedlywithoutreferencetomainmemory.Thetagmustbesetwheneversuchacriticalsectionisenteredandresetwheneverthecriticalsectionisleft,i.e.,themodifiedwordiswrittenbacktomainstorage.Thisapproachrequiressubstantialhardware,andappearsinfeasibleforalargenumberofcaches,sinceanoperationinacentralplaceisrequiredattheentryorexitofanycriticalsection.Ourapproachhasmuchincommonwiththethirdapproach,butallowsthecriticalsectioninformationtobedistributedamongthecaches,whereitalreadyresides.Inaddition,weusethenormalreadandwriteoperations,withnotagbitsinmainmemory,toaccom-plishthesynchronization.ArelatedschemeArndah182whichusesaspecialbustoconveythenoticeofentryorexitfromacriticalsection,hasbeenimplementedinacommercialproduct,buthasnotbeenpublishedtoourknowledge.Wecallourschemewr/te-once.3.1.Write-ThroughorWrite-Back?WhilethechoicebetweenwriLe-through(alsoknownasstore-through)andwrite-back(alsoknownasstore-backorcopy-back)hasnobearingonthereadhitratio,ithasamajorimpactonbustraffic,particularlyasthehitratioapproaches100%.Inthelimit,whenthehitratioisi00%.write-backresultsinnobustrafficatall.whilewrite-throughrequiresatleastonebuscycleforeachwriteoperation.NortonNorton82concludedthatusingwrite-backinsteadofwrite-throughforahypothet-icalprocessortypicallywouldreducethebustrafficbymorethan50%andiftheprocessesrantocompletionbustrafficwouldbedecreasedbyafactorof8.Fortypi-calread-to-writeandhitratiosandwhentaskswitchingisinfrequent,oursimulationshavegivenstrongevidencethatwrite-backgeneratessubstantiallylessbustrafficthanwrite-through.126Butwrite-backhasmoreseverecoherencyprob-lemsthanwrite-through,sinceevenmainmemorydoesnotalwayscontainthecurrentversionofaparticularmemorylocation.3.2.ANewWriteStrategy:Write-OnceWeproposeanewwritestrategywhichsolvesthestaledataproblemandproducesminimalbustraffic.Thereplacementtechniquerequiresthefollowingstruc-ture.Associatedwitheachblockinthecachearetwobitsdefiningoneoffourstatesfortheassociateddata:InvalidThereisnodataintheblock.VainThereisdataintheblockwhichhasbeenreadfrombackingstoreandhasnotbeenmodified.ReservedThedataintheblockhasbeenlocallymodifiedexactlyoncesinceitwasbroughtintothecacheandthechangehasbeentransmittedtobackingstore.D/tryThedataintheblockhasbeenlocallymodifiedmorethanoncesinceitwasbroughtintothecacheandthelatestchangehasnotbeentransmittedtobackingstore.Write-oncerequiresrapidaccesstotheaddresstagsandstatebitpairsconcurrentlywithaccessestotheaddresstagsbytheCPU.Thiscanmosteasilybeachievedbycreatingtwo(identical)copiesofthetagmemory.CensierCensier78claimsthatduplicationistheusualwayoutforresolvingcollisionsbetweencacheinvalidationrequestsandnormalcacherefer-ences.Thisisnotalargecost,sinceasinglechipdesignofthispartofthecache-usingpresenttechnology-isquitefeasible.Further,wehavediscoveredawaytoreducesubstantiallythenumberoftagsrequired.Inaddition,thesamechiptypecouldbeusedforbothinstances.ThisisanaturalwaytopartitionthecacheinVLSIbecauseitresultsinamaximallogic-to-pinratio.WehavedesignedandsubmittedforfabricationsuchachipRavisha_nkar83.Thetwocopiesalwayscontainexactlythesameaddressdata,becausetheyarealwayswrittensimul-taneously.WhileoneunitisusedintheconventionalwaytosupportaccessesbytheCPU,asecondmonitorsallaccessestomemoryviatheMultibus.Foreachsuchoperation,itchecksfortheaddressinthelocalcache.Ifamatchisfoundonawriteoperation,itnottiesthecachecontroller,andtheappropriateblockinthecacheismarkedinvalid.Ifamatchisfoundonareadopera-tion,nothingisdoneunlesstheblockhasbeenmodified,i.e.,itsstateisreservedordirty.Ifitisjustreserved,thestateischangedtovalid.Ifitisdirty,thelocalsys-temsinhibitsthebackingstorefromsupplyingthedata.Itthensuppliesthedataitself.4Onthesamebusaccessorimmediatelyfollowingit,thedatamustbewrittentobackingstore.Inaddition,foreitherreservedordirtydata,thestateischangedtovalid.Thisschemeachievescoherencyinthefollowingway.Initiallywrite-throughisemployed.However,anadditionalgoalisachieveduponwriting.Allothercachesarepurgedoftheblockbeingwritten,sothecachewrit-ingthroughthebusnowisguaranteedtheonlycopyexceptforbackingstore.Itissoidentifiedbybeingmarkedreserved.Ifitispurgedatthispoint,nowriteisnecessarytobackingstore,sothisisessentiallywrite-through.fanotherwriteoccurs,theblockismarkeddirty.Nowwrite-backisemployedand,onpurging,thedatamustberewrittentobackingstore.4ThereisamechanisminMultibuswhichallowsthiscapability.Unfortunately,itisrarelyused,notwell-defined,andrequiresthatocalcachesrespondveryrapidly.Versabushasamuchcleanermechanismbywhichthisendcanbeaccomplished.Write-oncehasthedesirablefeaturethatunitsaccessingbackingstoreneednothaveacache,andneednotknowwhetherothersdoornot.Acacheisresponsi-bleformaintainingconsistencyexactlyforthosecaseswhereitmightcreateaviolation,i.e.,wheneveritwritestoalocation.Thusitispossibletomixinanarbitrarywaysystemswhichemployacacheandthosewhichdonot；thelatterwouldprobablybeI/0devices.Consider-ablecaremustbeexercised,however,whenawriteoperationoverthebusmodifieslessthananentireblock.4.SimulationWedesignedacachememorysystemtoworkonMultibus.Tovalidateourdesignbeforebuildingitwedidextensivesimulationusingmemorytracedata.Todatewehaveperformedextensivesimulationsforsixtraces,allrunningunderUNIX:EDCTheUNIXeditoredrunningascript.ROFFASTheoldUNIXtextprocessorprogramroff.TRACETheprogram,writteninassemblylanguage,whichgeneratedtheabovetracesforthePDP-11.NROFFTheprogramtroFinterpretingtheBerkeleymacropackage-me.CACHEThetrace-drivencachesimulatorpro-gram.COMPACTAprogramusinganon-linealgorithmwhichcompressesfilesusinganadaptiveHuffmancode.ThefirstthreetracesareforaPDP-11,whilethelatterthreeareforaVAX.WhilethePDP-11doesnotrunonMultibus,itsinstructionsetissimilartomanymicropro-cessorswhichdo,andtheprogramsusedfortracingwereofthekindweenvisionforsuchasystem.ThePDP-11issimilarinmanywaystotheMC58000,andhasincommonwiththe8086alimitedaddressingcapability.WhiletheVAXalsodoesnotrunonMultibus,itisanexampleofamoderninstructionsetand,thereforeisareasonableexampleofthekindofprocessorlikelytoappearinasingle-chipCPUinthefuture.Italsohasalargeraddressspacewhich,asshowninsection4.3,issignificant.Weareactuallyusingvirtualaddresses,butalloftheprogramsweranaresmallenoughtofitintomainmemory.Sincewearetracingonlyasinglepro-cess,weconcludethatthereisnosignificantdifferencebetweenvirtualandrealaddresses.Inadditiontocacheparameters,missratiosvarygreatlydependingontheprogramrunning.Fortheeachoftheabovetraces,awideandunpredictablevariationoccurredaswevariedasingleparameter.Thusplottingparametersfortheindividualtraceswasoftennotenlightening.Averagingoverthethreetracesineachcategorygavemuchmorerevealingresults,providingdatathatsuggestedacontinuousfunctionformanyofthevariablesstudied.Thusallourresultsareactuallytheaverageofthreeprograms,eachrunningalone.4.1.EffectofWriteStrategyonBusTrallieAitheughwrite-throughnormallygenerateslessbustrafficthanwrite-back,thelattercanbeworseifthehitratioislowandtheblocksizeislarge.Underwrite-back,whenadirtyblockispurged,theentireblockmustbewrittenout.Withwrite-through,onlythatpor-tionwhichwasmodifiedmustbewritten.Wefoundthatwrite-backisdecisivelysuperiortowrite-throughexcept(1)whencacheblocksareverylarge,or(2)whenthecachesizeisverysmall.SUNIXandNROFFaretrademarksofBellLaboratories.127Write-onceresultsinbustrafficroughlyequaltothebetterofthetwo.Wehavefoundcacheparametersforwhichitactuallyperformsbetterontheaveragethaneitherwrite-throughorwrite-backforanumberofpro-grams.Thiswasasurprisingresult,sincewrite-oncewasdevelopedtoassurecoherency,nottominimizebustraffic.Thereplacementschemeoutperformsbothwrite-throughandwrite-backwheneverthetotalnumberofsetsisabout16.Forexample,fora4-waysetassocia-tive,2048-bytecachewithablocksizeof32bytes,theaveragebustrafficforthreePDP-11programsforwhichwehavetraceswas30.768Zforwrite-through,17.55Zforwrite-back,and17.387oforwrite-once,4.2.ColdStartvs.WarmStartAnimportantconsiderationindeterminingcachehitratioandbustrafficisthecoldstartperiodknownasthelifetimefunctionEaston78,duringwhichtimemanymissesoccurbecausethecacheisempty.Thisisdefinedastheperioduntilasmanycachemisseshaveoccurredasthetotalnumberofblocksinthecache.Thisinitialburstofmissesisamortizedoverallaccesses,sothelongerthetraceanalyzed,thelowerthemissratioobtained.Inadditiontotheinitiationofaprogramandoccasionalswitchesofenvironments,acoldstartgen-erallyoccurswheneverthereisataskswitch.Thusanimportantassumptionintraditionalcacheevaluationisthefrequencyoftaskswitching.Wehavearguedthattaskswitchingmustbeveryinfrequentinoursystem.Thuswecanmorenearlyapproachinpracticethewarmstarthitratios,andthusitisappropriatetouseverylongtracesofasingleprogram,andassumeawarmstart.Wedidthatinitially,usingthefulllengthofthePDP-11tracesavailabletous(1,256,570memoryaccesses).Wenoted,however,thatforevenmuchshortertracesthanwewererunning,therewaslittledifferencebetweenwarmstartandcoldstartstatistics.Sincecoldstartstatisticsareeasiertogenerate,wenor-mallyusedthem.Unlessstatedotherwise,ourresultsarefromcoldstart,butatleast10timesthelifetimefunctionintotallength.4.3.CacheSizeIngeneral,weweresurprisedattheeffectivenessofasmallcache.ForthePDP-11traceswithacacheof2Kbytesorlarger,wediscoveredthatessentiallynomissesoccurredafterthecoldstartperiod.Thesearenottrivialprograms,butwererunonamachinewhichhasonly64Kbytesforbothinstructionsanddata.Thepro-gramsareveryfrugalintheiruseofmemory,andtheentireworkingsetapparentlycanfitinthecache.TheVAXtracesdonotexhibitthesamelocalityobservedwiththePDP-11,anda64K-bytecachewasnotlargeenoughtocontaintheentireworkingsetoftheprogram.Thismaybearesultofthelargeraddressspaceavailable,themorecomplexinstructionset,ormorecomplexprograms.InallcasestheprogramswerespreadoutoveramuchlargermemoryspacethanforthePDP-11traces.Forthiscomparisonweusedasmallblocksizeof4bytes.ThismayhavehadagreaterimpactontheVAXthanonPDP-11traces.Wefoundthatreducingthesizeofthecache(below4KbytesforthePDP-11)increasedthemissratioandthebustraffic-ingeneralthetwocorrelatewellwithrespecttothisparameter.Fig.1showstheaveragemissratioandbustrafficasafunctionoftotalcachesizeforthePDP-11traces.Forthisandallresultsgiven,themissratioincludeswrites.Thebustrafficisgivenasapercentofthenumberofaccessesthatwouldberequiredifnocachewerepresent.Fig.2showsthesamedatafortheVAXtraces.4.4.HIockSizeOurcachedesignincorporatesextremelysmallblocks,dependingheavilyontemporallocality.EastonandFaginEaston78claimthatpagesizeandmissratiosareindependentforwarmstart,buthighlydepen-dentforcoldstart.Iftrue,thiscanbeexplainedbytheobservationthathitsinthecacheoncoldstartdependheavilyonspatiallocality,whiletemporallocalitypro-videsmanyhitswhenitiswarm.Spatiallocality,how-ever,isstronglycorrelatedtoblocksize.beingdirectlyproportionalintheextremecaseofstrictlysequentialmemoryaccesses.OursimulationspartiallyconfirmEastonsobservation.Inparticular,wefoundthat,asblocksizeisincreased,missratiosgenerallydeclineuptoapoint,thenincreaseforeitherwarmorcoldstarts.However,forsmallblocksizes,thewarmstartmissratioismarginallylowerthanforthecoldstartcase,whileforlargeblocksizes,thetwonumbersarenearlyidentical.Seefigs.3-7.Thisisencouragingsincewehavearguedforrestrictedtaskswitches:ourenvironmentismorethatofawarmstartthanisthetraditionalenvironment.Inmanysimulationswewereabletogetveryhighhitratiosoncethecoldstartperiodended.Forsmallblockstransferred,however,thisperiod(thelifetimefunction)islonger.Oursimulationsshowveryclearlythatreducingtheblocksizedowntoasingletransferacrossthebusdramaticallydecreasesthehitratio,par-ticularlyforcoldstarts,butalsodecreasesbustrafficsignificantly.Ingeneral,weobservedthatincreasingthetransferblocksizefromonebuscycletotwotypicallydecreasesthemissratioby30to50,whileincreasingthebustrafficby10to20.Thisrelationholdsforthefirsttwodoublings.Theseresults,e.g.,fig.3,arerela-tivelymorepessimisticforsmallblocksizesthanthosereportedbyStreckerinBell78.Wehavemadetheassumptionthataccesstimeisrelatedlinearlytoblocksize.Inmanycasesthisisnottrue.ItisessentiallytruefortheMultibus,sinceonlytwobytescanbefetchedatatime,andarbitrationisoverlappedwithbusoperations.Forasingle-chipimple-mentation,itwouldalmostcertainlybeworthwhiletoprovidethecapabilityforefficientmultipletransfersoverasetofwiresintotheprocessor.Thishasnotbeenincorporatedinouranalysis,butwillundoubtedlysug-gestasomewhatlargertransferblocksize.4.4.1.LoweringtheOverheadofnallBlocksSmallblocksarecostlyinthattheygreatlyincreasetheoverheadofthecache:anaddresstagandthetwostatebitsarenormallystoredinthecacheforeachblocktransferred.Wereducedthisoverheadbysplittingthenotionofblockintotwoparts:(1)Thetransferblockistheamountofdatatransferredfrombackingstoreintothecacheonareadmiss.(2)Theaddressblockisthequantumofstoragefor.whichatagismaintainedinthecache.Itisalwaysapoweroftwolargerthanatransferblock.Aneffectivecachecanbeimplementedbykeepingthetransferblocksmallbutmakingtheaddressblocklarger.Formostcommercialproductscontainingacache,theaddressblocksizeisthesameasthetransferblocksize,thoughweknowofoneexampleIBM74wheretheaddressblockcontainedtwotransferblocks.TheIBMSystem/360Model85Liptay68infactisaspecialcaseofthis,v/z.,adirect-mappedcache,wheretheModel85sector,consistingof1Kbytes,correspondstoouraddressblock.Eachsectorcontains16transferblocks,whichwerecalledsimplyblocks.4.4.2.TheEffectofLargeAddressBlocksTheuseofaddressblockslargerthantransferblocksmeansthatonlydatafromoneaddressblockin128backingstorecanoccupyanyofthetlocksmakingupanaddressblockinthecache.Therearecaseswheretheappropriatetransferblockisempty,butothertransferblocksinthesameaddressblockmustbepurgedsothatthenewaddressblockcanbeallocated.Weexaminedthisforvarioussizesofaddressblocksandfoundthatthemissratioincreasedveryslowlyuptoapoint.Forthesituationshowninfig.7,themissratiohadonlyrisenbyabout30%whentheaddressblockcon-tained64bytesforthePDP-11traces.ThatpointwasreachedfortheVAXtraceswhenitcontained32-byteaddressblocks.Wepredictedthatthebustrafficwouldcorrelatewellwithmissratiowithrespecttothisparameter.Tooursurprise,thebustrafficactuallydeclinedinitiallyasweincreasedtheaddressblocksize.Thedeclinewassmall,butconsistentforthePDP-11tracetapes,eventu-allyclimbingoverthebaselinewhentheaddressblockwas16or32transferblocks.Thephenomenonwassmaller,butdiscerniblefortheVAXtracesaswell,thoughinallcasesthebustrafficstartedincreasingsooner.Thissituationisshowninfig.8.Weinitiallysuspectedthatoursimulationmightbefaulty.Thatwasnotthecase,andeventuallywewereabletoexplainitandverifyit.Thewrite-oncealgorithmrequiresabusoperationwheneverablockismodifiedinitially(settoreserved.)However,reservationscouldbemadeonthebasisofeithertransferblocksoraddressblocks.Wehadputthechoiceintothesimulator,buthadnotexperimentedwithit,reservingattheaddressblocklevel.Thisinfactreducesthenumberofbuswritesnecessaryforreservationbecauseofspatiallocal-ityofwrites:anaddressblockalreadyfeservedneedonlybemarkeddirtywhenanytransferblockwithinitismodified.Thiswouldseemtoincreasegreatlythetrafficwhenevertheblockispurgedfromthecache,butinfacttheeffectissmall:onlythosetransferblockswhichhaveactuallybeenmodifiedneedbewrittenback.Wedemonstratedthatthiswasindeedresponsibleforthebehaviornotedbychangingthereservationstothetransferblocklevel.Thesimulationresultsthenexhibitedtheoriginallypredictedbehavior,correlatingcloselywiththemissratio.Weconcludethatminimumbustrafficisgeneratedwithminimumtransferblocksizes.Themissratiomaybesubstantiallyimprovedbyusingslightlylargertransferblocks,inwhichcasebustrafficdoesnotincreasegreatly.Usinglargeraddressblocksreducesthecostofthetagmemoryconsiderably.Itinitiallyhasonlyaminoreffectonmissratio,whichismorethanoffsetbythesavingsinwritesduetothemoreefficientreservationofmodifiedblocks4.5.OtherDesignAspectsStudied4.5.1.WriteAllocationWriteallocation,alsoknownasfetchonmte,meansthatablockisallocatedinthecacheonawritemissaswellasonareadmiss.Whileitseemsnaturalforwrite-back,ittypicallyisnotusedwithwrite-through.Itisessentialforwrite-oncetoassurecoherency.Ourearlysimulationsshowedthatitwashighlydesirableforwrite-backandwrite-once,andsuperiorevenforwrite-throughwithsmallblocks.Thiswastrueusingboththemeasuresofmissratioandbustraffic.Inallresultspresented,writeallocationwasemployed.4.5.AssociaUvityWerananumberofsimulationsvaryingtheassocia-tivityallthewayfromdirectmappedtofullyassociative.Whilethisisclearlyanimportantparameter,wehavelit-tlenewtoreport,i.e.,fullyassociativecacheisthebest,but2-waysetassociativeisnotmuchworse,and4-waysetassociativeissomewhereinbetween.(ButseeSmith83).Wehadhopedtofindthatahighdegreeofassoeiativitywouldimproveperformance,becausesuchanorganizationismuchmorefeasibleintheVLSIdomain,butresultswerenegative.Forresultsreportedherewehaveassumeda4-way,setassociativecache.4.5.3.ReplacementAlgorithmReplacementstrategyhasbeenthesubjectofanotherstudyusingthesamesimulatorandtracesSmith83.Inordertolimititssignificance,whichseemstobeorthogonaltotheissuesraisedhere,wehaveassumedtrueLRUreplacementamongtheelementsofeachsetinallcases.4.5.4.BusWidthThewidthofthedatapathsbetweenunitsisanimportantparameterinthatitiscloselyrelatedtobandwidth.WehavethecapabilitytospecifythebuswidthbothfrombackingstoretocacheandfromcachetoCPU.Forthepurposesofthisstudy,wehaveassumedinallcasesthattheVAXmemorysupplies4bytestothecacheinonebuscycle,whilethePDP-11memorysup-plies2bytes.An8-bytetransferthereforeiscountedastwocyclesfortheVAXand4forthePDP-11.Weassumedthatthecachesuppliedoneword-16bitsforthePDP-11and32bitsfortheVAX-totheCPUoneachrequest.However,theintelligenceofthepro-cessordetermineshowoftenthesamewordmustbefetched.Thetracetapescontainallmemoryreferences.Wefilteredthesewiththeassumptionthatoninstructionfetchesthesamewordwouldnotbefetchedwithoutaninterveninginstructionfetch.Nofilteringwasdoneondatafetches,5.SummaryOursimulationsindicatethatasingleboardcom-puterwitha4K-bytecachecanperformreasonablywellwithlessthan10%oftheaccessesrequiredtoitspri-marymemorywithoutacache.ThePDP-11tracessug-gestanumberaslowas3%.WhiletheVAXnumbersarehigher,additionaldeclineswillbeexperiencedbyincreasingthesizeofthecachebeyond4Kbytes.Animportantresultistheuseofthewrite-oncealgorithmtoguaranteeconsistentdataamongmultipleprocessors.Wehaveshownthatthisalgorithmcanbeimplementedinawaythatdegradesperformanceonlytrivially(ignoringactualcollisions,whicharerare),andperformsbetterthaneitherpurewrite-backorwrite-throughinmanyinstances.Theuseofsmalltransferblocksizescanbecoupledwithlargeaddressblockstobuildaninexpensivecachewhichperformseffectivelyintheabsenceoffrequentprocessswitches.Thelowbusutilizationandthesolu-tiontothestaledataproblemmakepossibleanenviron-mentforwhichthisconditionismet.Eventhoughthemissratioincreases,bustrafficinitiallydeclinesastheaddressblockisenlarged,holdingthetransferblockconstant.Thereforelargeraddressblocksshouldbeusedforreservingmemoryformodificationevenifsmallblocksareusedfortransferofdata.Theapproachadvocatedhereisappropriateonlyforasystemcontainingasinglelogicalmemory.Thisis.significantbecauseitdependsontheserializationofmemoryaccessestoassureconsistency.Ithasapplica-tionsbeyondthosestudiedhere,however.Forexample,theaccesspathtomemorycouldbeviaaringnetwork,oranyothertechniqueinwhicheveryrequestpasseseveryprocessor.Thisextensionseemsparticularlyapplicableattheformaintainingconsistencyforafilesystemoracommonvirtualmemorybeingsuppliedto129multipleprocessorsthroughacommonbussuchasEth-ernet.Clearlytherearemanyenvironmentsforwhichthismodelisinappropriate-responsetoindividualtasksmaybeunpredictable,forexample.However,webelievethatsuchaconfigurationhasmanypotentialapplica-tionsandcanheexploitedeconomicallyiftheappropri-ateVLSIcomponentsaredesigned.Wehaveinvestigatedthedesignofsuchcomponentsandbelievethattheyarebothfeasibleandwell-suitedforVLSIRavishankar83.OuranalysisindicatesthatthecacheapproachisreasonableforasystemwherebandwidthbetweentheCPUandmostofitsmemoryisseverelylimited.Wehavedemonstratedthroughsimulationofrealprogramsthatacachememorycanbeusedtosignificantlyreducetheamountofcommunicationaprocessorrequires.Whitewewereinterestedinthisforasingle-chipmicrocom-puterofthefuture,wehavealsodemonstratedthatsuchanapproachisfeasibleforoneormorecurrentlypopu-larcommercialmarkets.6.AcknowledgementsThismaterialisbaseduponworksupportedbytheNationalScienceFoundationunderGrantMCS-8202952.WethankDr.A.J.SmithforprovidingthePDP-11tracetapesuponwhichmuchofourearlyworkdepended.WealsowishtothankT.-H.Yangfordevelop-ingtheVAXtracefacility.P.VitaleandT.Doylecontri-butedmuchthroughdiscussionsandbycommentingonanearlydraftofthemanuscript.7.ReferencesAmdahl82C.Amdahl,privatecommuvticttiovt,March82.Bell74J.Bell,D.Casasent,andC.G.Bell,Aninvestiga-tionofalternativecacheorganizations,IEEETrans.onComputers,Vol.C-23,No.4,April1974,pp.348-351.Bell78C.BeU,J.Judge,J.McNamara,Computerengineeriy:aDECviezuofttardaresystemdesign,DigitalPress,Bedford,Mass.,1978.Censier78L.M.CensierandP.Feautrier,Anewsolu

人人文库> 全部分类> 专业文献 > 工程机械

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

32-Using Cache Memory to Reduce Processor-Memory Traffic.pdf

文档简介

温馨提示

最新文档

评论

32-Using Cache Memory to Reduce Processor-Memory Traffic.pdf

文档简介

温馨提示

最新文档

评论

相关文档