已阅读5页,还剩3页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
USINGCACHEMEMORYTOREDUCEPROCESSOR-MEMORYTRAFFICJamesR.GoodmanDepartmentofComputerSciencesUniversityofWisconsin-MadisonMadison,WI58706ABSTRACT-Theimportanceofreducingprocessor-memorybandwidthisrecognizedintwodistinctsitua-tions:singleboardcomputersystemsandmicroproces-sorsofthefuture.Cachememoryisinvestigatedasawaytoreducethememory-processortraffic.Weshowthattraditionalcacheswhichdependheavilyonspatiallocality(look-ahead)fortheirperformanceareinap-propriateintheseenvironmentsbecausetheygeneratelargeburstsofbustraffic.Acacheexploitingprimarilytemporallocality(look-behind)isthenproposedanddemonstratedtobeeffectiveinanenvironmentwhereprocessswitchesareinfrequent.Wearguethatsuchanenvironmentispossibleifthetraffictobackingstoreissmallenoughthatmanyprocessorscanshareacommonmemoryandifthecachedataconsistencyproblemissolved.Wedemonstratethatsuchacachecanindeedreducetraffictomemorygreatly,andintroducee.relegantsolutiontothecachecoherencyproblem.1.IntroductionBecausetherearestraightforwardwaystocon-structpowerful,cost-effectivesystemsusingrandomaccessmemoriesandsingle-chipmicroprocessors,sem-iconductortechnologyhas,untilnow,hadthegreatestimpactthroughthesecomponents.High-performanceprocessors,however,arestillbeyondthecapabilityofasingle-chipimplementationandarenoteasilyparti-tionedinawaywhichcaneffectivelyexploitthetechnol-ogyandeconomiesofVLSI.Aninterestingphenomenonhasoccurredinthepreviousdecadeasaresultofthisdisparity.Memorycostshavedroppedradicallyandcon-sistentlyforcomputersystemsofallsizes.WhilethecomponentcostofaCPU(single-chipimplementationsexcluded)hasdeclinedsignificantlyoverthesameperiod,thereductionhasbeenlessdramatic.Aresultisthattheamountofmemorythoughttoheappropriateforagivenspeedprocessorhasgrowndramaticallyinrecentyears.Todaysmallminicomputershavememoryaslargeasthatofthemostexpensivemachinesofadecadeago.Permissiontocopywithoutfeeallorpartofthismaterialisgrantedprovidedthatthecopiesarenotmadeordistributedfordirectcommercialadvantage,theACMcopyrightnoticeandthetitleofthepublicationanditsdateappear,andnoticeisgiventhatcopyingisbypermissionoftheAssociationforComputingMachinery.Tocopyotherwise,ortorepublish,requiresafeeand/orspecificpermission.TheimpactofVLSIhasbeenverydifferentinmicroprocessorapplications.Herememoryisstillregardedasanexpensivecomponentinthesystem,andthosefamiliarprimarilywithaminicomputerormain-frameenvironmentareoftenscornfulofthetroubletowhichmicroprocessorusersgotoconservememory.Thereason,ofcourse,isthateventhesmallmemoryinamicroprocessorisamuchlargerportionofthetotalsys-temcostthanthemuchlargermemoryonatypicalmainframesystem.Thisresultsfromthefactthatmemoryandprocessorsareimplementedinthesametechnology.1.1.ASuperCPUWiththeadvancestoVLSIoccurringnowandcon-tinuingoverthenextfewyears,itwillbecomepossibletofabricatecircuitsthatareonetotwoordersofmagni-tudemorecomplexthancurrentlyavailablemicropro-cessors.Itwillsoonbepossibletofabricateanextremelyhigh-performanceCPUonasinglechip,IftheentirechipisdevotedtotheCPU,however,itisnotagoodidea.Extrapolatinghistoricaltrendstopredictfuturecomponentdensities,wemightexpectthatwithinafewyearsweshouldbeabletopurchaseasingle-chipprocessorcontainingatleasttentimesasmanytransis-torsasoccurin,say,theMC68000.FortheempiricalruleknownasGroschslawGrosch53,P=kCg,wherePissomemeasureofperformance,Cisthecost,andkandgareconstants,KnightKnight66concludedthatgisatleast2,andSolomonSolomon66hassuggestedthatg1.47.FortheIBMSystem/B70family,Siewiorekdeterminedthatgal.8Siewiorek82.WhileGroschslawbreaksdowninthecomparisonofprocessorsusingdifferenttechnologyorarchitectures,itisrealisticforpredictingimprovementswithinasingletechnology.Siewiorekinfactsuggeststhatitholdsbydefinition.Assumingg=1.5andusingprocessor-memorybandwidthasourmeasureofperformance,Groschslawpredictsthataprocessorcontainingi0timesasmanytransistorsasacurrentmicroprocessorwouldrequire30timesthememorybandwidth.*TheMotorolaMC68000,runningat10MHz,accessesdatafrommemoryatamaximumrateof5millionbytespersecond,usingmorethanhalfitspinstoachievethisrate.Althoughpackag-ingtechnologyisrapidlyincreasingthepinsavailabletoachip,itisunlikelythattheincreasewillhe30-fold(the68000has84pins).Wewouldsuggestafactoroftwoisrealistic.Althoughsometechniquesareclearlypossibletoincreasethetransferrateintoandoutofthe68000,supplyingsuchaprocessorwithdataasfastasneededisasevereconstraint.Oneofthedesignersofthe88000,hasstatedthatallmodernmicroprocessors-the680001Thisisaconservativeestimate,infact,becauseitignorespredictabledecreasesinatedelays.1983ACM0149-7111/83/0600/0124501.00124included-arealreadybus-limitedTredennickB2.1.2.On.-chipMemoryOnealternativeforincreasedperformancewithoutproportionatelyincreasingprocessor-memorybandwidthistointroducememoryonthesamechipwiththeCPU.Withtheabilitytofabricatechipscontainingonetotwomilliontransistors,itshouldbepossible-usingonlyaportionofthechip-tobuildaprocessorsignificantlymorepowerfulthananycurrentlyavailablesingle-chipCPU.WhiledevotingtheentirechiptotheCPUcouldresultinastillmorepowerfulprocessor,introducingon-chipmemoryoffersareductioninmemoryaccesstimeduetotheinherentlysmallerdelaysascomparedtointer-chipdatatransfers.Ifmostaccesseswereon-chip,itmightactuallyperformasfastasthemorepowerfulprocessor.Ideally,thechipshouldcontainasmuchmemoryastheprocessorneedsformainstorage.Conventionalwisdomtodaysaysthataprocessorofthespeedofcurrentmicroprocessorsneedsatleast1/megabytesofmemoryLindsayB1.Thisiscertainlymorethanisfeasibleon-chip,thoughahighperformanceprocessorcouldprobablyusesubstantiallymorethanthat.ClearlyalltheprimarymemoryfortheprocessorcannotbeplacedonthesamechipwithapowerfulCPU.Whatisneededisthetopelementofamemoryhierarchy,1.3.CacheMemoryTheuseofcachememory,however,hasoftenaggra-vatedthebandwidthproblemratherthanreduceit.SmithSmithB2saysthatoptimizingthedesignhasfourgeneralaspects:(1)maximizingthehitratio,(2)minimizingtheaccesstimetodatainthecache,(3)minimizingthedelayduetoamiss,and(4)minimizingtheoverheadsofupdatingmainmemory,maintainingmulticacheconsistency,etc.Theresultisoftenalargerburstbandwidthrequirementfrommainstoragetothecachethanwouldbenecessarywithoutacache.Forexample,thecacheontheIBMSys-tem/370model16B,iscapableofreceivingdatafrommainmemoryatarateof100megabytespersecondIBM76,ItsuppliesdatatotheCPUatlessthan1/3thatrate.Thereasonisthattoexploitthespatiallocal-ityinmemoryreferences,thedatatransferredfrombackingstoreintothecacheisfetchedinlargeblocks,resultinginrequirementsofveryhighbandwidthburstsofdata.WehavemeasuredtheaveragebandwidthonanIBMSystem/370model155,andconcludedthattheflyer-agebacking-store-to-cachetrafficislessthanthecaehe-to-CPUtraffic.Thedesignofcachememoryformini-computersdemandedgreaterconcernforbusbandwidth.ThedesignersofthePDP-11models80and70clearlyrecog-nizedthatsmallblocksizeswerenecessarytokeepmainmemorytraffictoaminimumBell78.Loweringthebandwidthfrombackingstoretothecachecanbeaccomplishedinoneoftwoways:(1)smallblocksofdataarebroughtfrombackingstoretothecache,or(2)longdelaysoccurwhileablockisbeingbroughtin,independentof(andinadditionto)theac-cesstimeofthebackingstore.Whileitispossibletobringinthewordrequestedini-tially(readthrough),thusreducingthewaitonagivenreference,thelowbandwidthmemoryinterfacewillremainbusylongaftertheinitialtransferiscompleted,resultinginlongdelaysifasecondbackingstorageoperationisrequired.Wethereforehaveexploredtheeffectivenessofacachewhichexploitsprimarilyorexclusivelytemporallocality,i.e.,theblocksfetchedfrombackingstoreareonlythesizeneededbytheCPU(orpossiblyslightlylarger).Inconsideringwaystoevaluatethisstrategy,weidentifiedacommercialenvironmentthatcontainedmanyofthesameconstraintsandseemedamenabletothesamekindsofsolutions.Thisenvironmentisthemarketplaceofthesingle-boardcomputerrunningonastandardbussuchasMultibusorVersabus.sWehavechosentostudythisenvironmentinanattempttogaininsightintotheoriginal,generalscheme.2.TheSingleBoardComputerApplicationAsingleboardcomputertypicallycontainsamicroprocessorandasubstantialamountofmemory,thoughsmallenoughthatitmustbeusedcarefully.Ifneeded,accesstoadditionalrandomaccessmemoryisthroughthebus,whichisdesignedforgeneralityandsimplicity,notforhighperformance.Multibus,inpartic-ular,wasdefinedintheearly70stoofferaninexpensivemeansofcommunicationamongavarietyofsub-systems.AlthoughoriginallyintroducedbyIntelCor-poration,ithasfoundwideacceptance,havingbeenpro-posed-inaslightlymodifiedform-astheIEEEP796busstandard/EEEB0.Currently,severalhundredven-dorsofferMultibus-compatiblecards.Whilethemarkethasrapidlydevelopedforproductsusingthisbus,itsapplicationsarelimitedbythesevereconstraintimposedbythebandwidthofMultibus.Clearlythebusbandwidthcouldbeincreasedbyincreas-ingthenumberofpins,andbymodifyingtheprotocol.Itsbroadpopularityandtheavailabilityofcomponentstoimplementitsprotocolmean,however,thatitislikelytosurvivemanyyearsinitspresentform.Thusalargemarketexistsforacomputer-on-a-cardwhich,muchasifitwereallonasinglechip.hasseverelimitationsonitscommunicationswiththerestofthesystem.WedecidedtodetermineifacachememorysystemcouldbeimplementedeffectivelyintheMultibusenvironment.Tothatendwehavedesignedacachetobeusedwithacurrent-generationmicroprocessor.Inaddition,wehavedoneextensivesimulationofcacheperformance,drivenbymemorytracedata.WehaveidentifiedanewcomponentwhichisparticularlysuitedforVLSIimplementationandhavedemonstrateditsfeasibilitybydesigningitRavishankarB3.Thiscom-ponent,whichimplementsthetagmemoryforadynamicRAMcacheintendedforamicroprocessor,issimilarinmanyrespectstotherecentlyannouncedTMS2150Tree.Multibussystemshavegenerallydealtwiththeprob-lemoflimitedbusbandwidthbyremovingmostoftheprocessor-memoryaccessesfromthebus.Eachproces-sorcardhasitsownlocalmemory,whichmaybeaddressabletoothersthroughtheMultibus.Whilethisapproachhasmuchincommonwithours,webelievethattheallocationofmemory-localorremote-shouldbehandledbythesystem,freeingtheprogrammerofthistask.IntypicalMultibusapplications,considerableeffortisexpendedguaranteeingthattheprogramrun-ningisprimarilyresidenton-board.Thisapproachisviableforastaticpartitioningoftasks.Resultstodatehavebeenmuchlesssatisfactory,however,forthemoregeneralsituationwhereanumberofprocessorsaredynamicallyallocated.(Forefficiencyreasonsitalsoprecludestheuseofsharedcodesegments).eMultibusisatrademarkofIntelCorporation.SVersabusisatrademarkofMotorola.125Inmanyenvironments,asimplp,dynamichardwareallocationschemecanefficientlydeterminewhatmemorylocationsarebeingaccessedfrequentlyandshouldthereforebekeptinlocalmemory-betterthantheprogrammerwhooftenhaslittleinsightintothedynamiccharacteristicsofhisprogram.Thereareenvironmentswheretheprogrammerisintimatelyfami-liarwiththebehaviorofhisprogramandcangeneratecodetotakeadvantageofit.Inthisenvironmentthetimespentrunningaprogramisoftenmuchmoresub-stantialthanthetimedevelopingtheprogram.Thisexplains,forexample,whyaninvisiblecacheisnotappropriateontheCRAY-1.Webelievethatfreeingtheprogrammerfromconcernaboutmemoryallocationisessentialwhereprogrammerproductivityiscritical.B.i.ASingle-BoardComputerwithCacheToevaluateourapproach,weproposedasingle-boardcomputercontaining,(possiblyalongwithotherthings)aCPUandnolocalmemoryexceptacache,withbackingstoreprovidedthroughMultibus.Thuswepickedanimportantprobleminitsownright:CanwebuildacachethatworkswithaMultibussystemsupport-ingmultipleprocessors?Inparticular,howmanypro-cessorscanwesupportrunninginparallelonMultibus?Webelievethatasystemwhichcouldreasonablysupportfiveto10processorswouldbeasignificantadvance.ThiscantbecompareddirectlyagainstcurrentsystemsbecauseasingleprocessoroverloadstheMultibus.Thuslocalmemoriesmustbeheavilyexploitedifperformanceisimportant.EarlieranalysesKaplan73,Bell74,Rao78,Pate182haveusedthecachehitratioorsomethingcloselyrelatedtomeasureperformance.Theimportantcri-terionhereistomaximizeuseofthebus,notthehitratio,orevennecessarilytooptimizeprocessorperfor-mance.Weoptimizesystemperformancebyoptimizingbusutilization,achievinghigherperformancebyminim-izingindividualprocessorsbusrequirements,andtherebysupportingmoreprocessorsreasonablywell.Weallowindividualprocessorstositidleperiodicallyratherthantieupthebusfetchingdatawhichtheymightnotuse.Thisimpliesthatthecachestaledataproblemmustbesolvedeffectively.Wepresentanewsolutioninsection3.2.2.SwitchingContextsWherebusbandwidthislimited,ataskswitchisamajordisturbance,sincethecachemusteffectivelybereloadedatthistime.Theprocessorismomentarilyreducedtoaccessesattherateatwhichthebuscansupplythem.Whilethisproblemseemsunavoidable,itneednotbeseriousiftaskswitchingisminimized.Weareprovidinganenvironmentwhichallowsmanyproces-sorstoworkoutofasinglemonolithicmemoryinparal-lel.Ifmoreparalleltasksarerequired,moreprocessorscanbeused.WepointoutthatthecurrentMultibusalternativeistomovetheprogramintolocalmemory,anoperationwhichalsoswampsthebus.Thetaskswitchmerelymakesthisoperationimplicit,andavoidsbring-ingacrossthebusdatawhichareneveractuallyused.Writingtheolddataoutisalsonoworsethanthealterna-tive,sinceweonlywritethatwhichhasbeenchangedandwhichhasnotbeenalreadypurged.Theremaybecertaincases-aninterrupthandlingprogram,forexample-whereaparticularprogramdoesnotflushthecache,butusesonlyasmallportionofit.Provisionscouldbemadetoallowsuchaprogramtobelockedinthecache.Alternatively,aseparatecachemightbeprovidedforsuchaprogramOurstudiesindi-catethatarelativelysmallcachecanbeeffectiveforasingleprogram,soitmaybepossibletokeepseparatecachesaroundforindividualprocessesifthenumberissmall.Wewouldsu&g;:,;,tmgthisonestepfurtherandprovidinganadditionalprocessorforeachcache.Aninterestingquestionthenarisesastothecostofdynami-callyassigningprocessestoprocessors.Ourproposalallowsthisassignment,thoughclearlyatsomeperfor-mancepenalty.3.CacheCoherencyItiswell-knownthatmultiplecachespresentseriousproblemsbecauseoftheredundancyofstorageofasin-glelogicalmemorylocationTang76,Censier78,Rao78.Themostcommonmethodamongcommercialproductsfordealingwiththis,thestaledataproblem,istocreateaspecial,high-speedbusonwhlehaddressesaresentwheneverawriteoperationisperformedbyanyproces-sor.ThissolutionhasweaknessesCeLisier78whichhavegenerallylimitedcommercialimplementationstotwoprocessors.Inthesingle-chipprocessororsingle-boardcomputerenvironments,ithastheaddedweak-nessthatitrequiresanumberofextraI/0pins.Analternativeapproach,implementedinC,mmpHoogendoorn77andproposedbyNortonNorton82.istorequiretheoperatingsystemtorecognizewhenincon-sistenciesmightoccurandtakestepstopreventthemunderthosecircumstances.Thissolutionisunappealingbecausethecacheisnormallyregardedasanarchitecture-independentfeature,invisibletothesoftware.Athirdapproach,variationsofwhichhavebeenpro-posedbyCensierandFeautrierCensier78,TangTang76,WiddoesWiddoes79,andYenandFuYen82,istousesomeformoftaggedmainmemory,keepingtrackofindividualblocksinthiswaytopreventincon-sistency.Individualblocksaretemporarilydesignatedasprivateforaparticularprocessorsothatitmaymodifyitrepeatedlywithoutreferencetomainmemory.Thetagmustbesetwheneversuchacriticalsectionisenteredandresetwheneverthecriticalsectionisleft,i.e.,themodifiedwordiswrittenbacktomainstorage.Thisapproachrequiressubstantialhardware,andappearsinfeasibleforalargenumberofcaches,sinceanoperationinacentralplaceisrequiredattheentryorexitofanycriticalsection.Ourapproachhasmuchincommonwiththethirdapproach,butallowsthecriticalsectioninformationtobedistributedamongthecaches,whereitalreadyresides.Inaddition,weusethenormalreadandwriteoperations,withnotagbitsinmainmemory,toaccom-plishthesynchronization.ArelatedschemeArndah182whichusesaspecialbustoconveythenoticeofentryorexitfromacriticalsection,hasbeenimplementedinacommercialproduct,buthasnotbeenpublishedtoourknowledge.Wecallourschemewr/te-once.3.1.Write-ThroughorWrite-Back?WhilethechoicebetweenwriLe-through(alsoknownasstore-through)andwrite-back(alsoknownasstore-backorcopy-back)hasnobearingonthereadhitratio,ithasamajorimpactonbustraffic,particularlyasthehitratioapproaches100%.Inthelimit,whenthehitratioisi00%.write-backresultsinnobustrafficatall.whilewrite-throughrequiresatleastonebuscycleforeachwriteoperation.NortonNorton82concludedthatusingwrite-backinsteadofwrite-throughforahypothet-icalprocessortypicallywouldreducethebustrafficbymorethan50%andiftheprocessesrantocompletionbustrafficwouldbedecreasedbyafactorof8.Fortypi-calread-to-writeandhitratiosandwhentaskswitchingisinfrequent,oursimulationshavegivenstrongevidencethatwrite-backgeneratessubstantiallylessbustrafficthanwrite-through.126Butwrite-backhasmoreseverecoherencyprob-lemsthanwrite-through,sinceevenmainmemorydoesnotalwayscontainthecurrentversionofaparticularmemorylocation.3.2.ANewWriteStrategy:Write-OnceWeproposeanewwritestrategywhichsolvesthestaledataproblemandproducesminimalbustraffic.Thereplacementtechniquerequiresthefollowingstruc-ture.Associatedwitheachblockinthecachearetwobitsdefiningoneoffourstatesfortheassociateddata:InvalidThereisnodataintheblock.VainThereisdataintheblockwhichhasbeenreadfrombackingstoreandhasnotbeenmodified.ReservedThedataintheblockhasbeenlocallymodifiedexactlyoncesinceitwasbroughtintothecacheandthechangehasbeentransmittedtobackingstore.D/tryThedataintheblockhasbeenlocallymodifiedmorethanoncesinceitwasbroughtintothecacheandthelatestchangehasnotbeentransmittedtobackingstore.Write-oncerequiresrapidaccesstotheaddresstagsandstatebitpairsconcurrentlywithaccessestotheaddresstagsbytheCPU.Thiscanmosteasilybeachievedbycreatingtwo(identical)copiesofthetagmemory.CensierCensier78claimsthatduplicationistheusualwayoutforresolvingcollisionsbetweencacheinvalidationrequestsandnormalcacherefer-ences.Thisisnotalargecost,sinceasinglechipdesignofthispartofthecache-usingpresenttechnology-isquitefeasible.Further,wehavediscoveredawaytoreducesubstantiallythenumberoftagsrequired.Inaddition,thesamechiptypecouldbeusedforbothinstances.ThisisanaturalwaytopartitionthecacheinVLSIbecauseitresultsinamaximallogic-to-pinratio.WehavedesignedandsubmittedforfabricationsuchachipRavisha_nkar83.Thetwocopiesalwayscontainexactlythesameaddressdata,becausetheyarealwayswrittensimul-taneously.WhileoneunitisusedintheconventionalwaytosupportaccessesbytheCPU,asecondmonitorsallaccessestomemoryviatheMultibus.Foreachsuchoperation,itchecksfortheaddressinthelocalcache.Ifamatchisfoundonawriteoperation,itnottiesthecachecontroller,andtheappropriateblockinthecacheismarkedinvalid.Ifamatchisfoundonareadopera-tion,nothingisdoneunlesstheblockhasbeenmodified,i.e.,itsstateisreservedordirty.Ifitisjustreserved,thestateischangedtovalid.Ifitisdirty,thelocalsys-temsinhibitsthebackingstorefromsupplyingthedata.Itthensuppliesthedataitself.4Onthesamebusaccessorimmediatelyfollowingit,thedatamustbewrittentobackingstore.Inaddition,foreitherreservedordirtydata,thestateischangedtovalid.Thisschemeachievescoherencyinthefollowingway.Initiallywrite-throughisemployed.However,anadditionalgoalisachieveduponwriting.Allothercachesarepurgedoftheblockbeingwritten,sothecachewrit-ingthroughthebusnowisguaranteedtheonlycopyexceptforbackingstore.Itissoidentifiedbybeingmarkedreserved.Ifitispurgedatthispoint,nowriteisnecessarytobackingstore,sothisisessentiallywrite-through.fanotherwriteoccurs,theblockismarkeddirty.Nowwrite-backisemployedand,onpurging,thedatamustberewrittentobackingstore.4ThereisamechanisminMultibuswhichallowsthiscapability.Unfortunately,itisrarelyused,notwell-defined,andrequiresthatocalcachesrespondveryrapidly.Versabushasamuchcleanermechanismbywhichthisendcanbeaccomplished.Write-oncehasthedesirablefeaturethatunitsaccessingbackingstoreneednothaveacache,andneednotknowwhetherothersdoornot.Acacheisresponsi-bleformaintainingconsistencyexactlyforthosecaseswhereitmightcreateaviolation,i.e.,wheneveritwritestoalocation.Thusitispossibletomixinanarbitrarywaysystemswhichemployacacheandthosewhichdonot;thelatterwouldprobablybeI/0devices.Consider-ablecaremustbeexercised,however,whenawriteoperationoverthebusmodifieslessthananentireblock.4.SimulationWedesignedacachememorysystemtoworkonMultibus.Tovalidateourdesignbeforebuildingitwedidextensivesimulationusingmemorytracedata.Todatewehaveperformedextensivesimulationsforsixtraces,allrunningunderUNIX:EDCTheUNIXeditoredrunningascript.ROFFASTheoldUNIXtextprocessorprogramroff.TRACETheprogram,writteninassemblylanguage,whichgeneratedtheabovetracesforthePDP-11.NROFFTheprogramtroFinterpretingtheBerkeleymacropackage-me.CACHEThetrace-drivencachesimulatorpro-gram.COMPACTAprogramusinganon-linealgorithmwhichcompressesfilesusinganadaptiveHuffmancode.ThefirstthreetracesareforaPDP-11,whilethelatterthreeareforaVAX.WhilethePDP-11doesnotrunonMultibus,itsinstructionsetissimilartomanymicropro-cessorswhichdo,andtheprogramsusedfortracingwereofthekindweenvisionforsuchasystem.ThePDP-11issimilarinmanywaystotheMC58000,andhasincommonwiththe8086alimitedaddressingcapability.WhiletheVAXalsodoesnotrunonMultibus,itisanexampleofamoderninstructionsetand,thereforeisareasonableexampleofthekindofprocessorlikelytoappearinasingle-chipCPUinthefuture.Italsohasalargeraddressspacewhich,asshowninsection4.3,issignificant.Weareactuallyusingvirtualaddresses,butalloftheprogramsweranaresmallenoughtofitintomainmemory.Sincewearetracingonlyasinglepro-cess,weconcludethatthereisnosignificantdifferencebetweenvirtualandrealaddresses.Inadditiontocacheparameters,missratiosvarygreatlydependingontheprogramrunning.Fortheeachoftheabovetraces,awideandunpredictablevariationoccurredaswevariedasingleparameter.Thusplottingparametersfortheindividualtraceswasoftennotenlightening.Averagingoverthethreetracesineachcategorygavemuchmorerevealingresults,providingdatathatsuggestedacontinuousfunctionformanyofthevariablesstudied.Thusallourresultsareactuallytheaverageofthreeprograms,eachrunningalone.4.1.EffectofWriteStrategyonBusTrallieAitheughwrite-throughnormallygenerateslessbustrafficthanwrite-back,thelattercanbeworseifthehitratioislowandtheblocksizeislarge.Underwrite-back,whenadirtyblockispurged,theentireblockmustbewrittenout.Withwrite-through,onlythatpor-tionwhichwasmodifiedmustbewritten.Wefoundthatwrite-backisdecisivelysuperiortowrite-throughexcept(1)whencacheblocksareverylarge,or(2)whenthecachesizeisverysmall.SUNIXandNROFFaretrademarksofBellLaboratories.127Write-onceresultsinbustrafficroughlyequaltothebetterofthetwo.Wehavefoundcacheparametersforwhichitactuallyperformsbetterontheaveragethaneitherwrite-throughorwrite-backforanumberofpro-grams.Thiswasasurprisingresult,sincewrite-oncewasdevelopedtoassurecoherency,nottominimizebustraffic.Thereplacementschemeoutperformsbothwrite-throughandwrite-backwheneverthetotalnumberofsetsisabout16.Forexample,fora4-waysetassocia-tive,2048-bytecachewithablocksizeof32bytes,theaveragebustrafficforthreePDP-11programsforwhichwehavetraceswas30.768Zforwrite-through,17.55Zforwrite-back,and17.387oforwrite-once,4.2.ColdStartvs.WarmStartAnimportantconsiderationindeterminingcachehitratioandbustrafficisthecoldstartperiodknownasthelifetimefunctionEaston78,duringwhichtimemanymissesoccurbecausethecacheisempty.Thisisdefinedastheperioduntilasmanycachemisseshaveoccurredasthetotalnumberofblocksinthecache.Thisinitialburstofmissesisamortizedoverallaccesses,sothelongerthetraceanalyzed,thelowerthemissratioobtained.Inadditiontotheinitiationofaprogramandoccasionalswitchesofenvironments,acoldstartgen-erallyoccurswheneverthereisataskswitch.Thusanimportantassumptionintraditionalcacheevaluationisthefrequencyoftaskswitching.Wehavearguedthattaskswitchingmustbeveryinfrequentinoursystem.Thuswecanmorenearlyapproachinpracticethewarmstarthitratios,andthusitisappropriatetouseverylongtracesofasingleprogram,andassumeawarmstart.Wedidthatinitially,usingthefulllengthofthePDP-11tracesavailabletous(1,256,570memoryaccesses).Wenoted,however,thatforevenmuchshortertracesthanwewererunning,therewaslittledifferencebetweenwarmstartandcoldstartstatistics.Sincecoldstartstatisticsareeasiertogenerate,wenor-mallyusedthem.Unlessstatedotherwise,ourresultsarefromcoldstart,butatleast10timesthelifetimefunctionintotallength.4.3.CacheSizeIngeneral,weweresurprisedattheeffectivenessofasmallcache.ForthePDP-11traceswithacacheof2Kbytesorlarger,wediscoveredthatessentiallynomissesoccurredafterthecoldstartperiod.Thesearenottrivialprograms,butwererunonamachinewhichhasonly64Kbytesforbothinstructionsanddata.Thepro-gramsareveryfrugalintheiruseofmemory,andtheentireworkingsetapparentlycanfitinthecache.TheVAXtracesdonotexhibitthesamelocalityobservedwiththePDP-11,anda64K-bytecachewasnotlargeenoughtocontaintheentireworkingsetoftheprogram.Thismaybearesultofthelargeraddressspaceavailable,themorecomplexinstructionset,ormorecomplexprograms.InallcasestheprogramswerespreadoutoveramuchlargermemoryspacethanforthePDP-11traces.Forthiscomparisonweusedasmallblocksizeof4bytes.ThismayhavehadagreaterimpactontheVAXthanonPDP-11traces.Wefoundthatreducingthesizeofthecache(below4KbytesforthePDP-11)increasedthemissratioandthebustraffic-ingeneralthetwocorrelatewellwithrespecttothisparameter.Fig.1showstheaveragemissratioandbustrafficasafunctionoftotalcachesizeforthePDP-11traces.Forthisandallresultsgiven,themissratioincludeswrites.Thebustrafficisgivenasapercentofthenumberofaccessesthatwouldberequiredifnocachewerepresent.Fig.2showsthesamedatafortheVAXtraces.4.4.HIockSizeOurcachedesignincorporatesextremelysmallblocks,dependingheavilyontemporallocality.EastonandFaginEaston78claimthatpagesizeandmissratiosareindependentforwarmstart,buthighlydepen-dentforcoldstart.Iftrue,thiscanbeexplainedbytheobservationthathitsinthecacheoncoldstartdependheavilyonspatiallocality,whiletemporallocalitypro-videsmanyhitswhenitiswarm.Spatiallocality,how-ever,isstronglycorrelatedtoblocksize.beingdirectlyproportionalintheextremecaseofstrictlysequentialmemoryaccesses.OursimulationspartiallyconfirmEastonsobservation.Inparticular,wefoundthat,asblocksizeisincreased,missratiosgenerallydeclineuptoapoint,thenincreaseforeitherwarmorcoldstarts.However,forsmallblocksizes,thewarmstartmissratioismarginallylowerthanforthecoldstartcase,whileforlargeblocksizes,thetwonumbersarenearlyidentical.Seefigs.3-7.Thisisencouragingsincewehavearguedforrestrictedtaskswitches:ourenvironmentismorethatofawarmstartthanisthetraditionalenvironment.Inmanysimulationswewereabletogetveryhighhitratiosoncethecoldstartperiodended.Forsmallblockstransferred,however,thisperiod(thelifetimefunction)islonger.Oursimulationsshowveryclearlythatreducingtheblocksizedowntoasingletransferacrossthebusdramaticallydecreasesthehitratio,par-ticularlyforcoldstarts,butalsodecreasesbustrafficsignificantly.Ingeneral,weobservedthatincreasingthetransferblocksizefromonebuscycletotwotypicallydecreasesthemissratioby30to50,whileincreasingthebustrafficby10to20.Thisrelationholdsforthefirsttwodoublings.Theseresults,e.g.,fig.3,arerela-tivelymorepessimisticforsmallblocksizesthanthosereportedbyStreckerinBell78.Wehavemadetheassumptionthataccesstimeisrelatedlinearlytoblocksize.Inmanycasesthisisnottrue.ItisessentiallytruefortheMultibus,sinceonlytwobytescanbefetchedatatime,andarbitrationisoverlappedwithbusoperations.Forasingle-chipimple-mentation,itwouldalmostcertainlybeworthwhiletoprovidethecapabilityforefficientmultipletransfersoverasetofwiresintotheprocessor.Thishasnotbeenincorporatedinouranalysis,butwillundoubtedlysug-gestasomewhatlargertransferblocksize.4.4.1.LoweringtheOverheadofnallBlocksSmallblocksarecostlyinthattheygreatlyincreasetheoverheadofthecache:anaddresstagandthetwostatebitsarenormallystoredinthecacheforeachblocktransferred.Wereducedthisoverheadbysplittingthenotionofblockintotwoparts:(1)Thetransferblockistheamountofdatatransferredfrombackingstoreintothecacheonareadmiss.(2)Theaddressblockisthequantumofstoragefor.whichatagismaintainedinthecache.Itisalwaysapoweroftwolargerthanatransferblock.Aneffectivecachecanbeimplementedbykeepingthetransferblocksmallbutmakingtheaddressblocklarger.Formostcommercialproductscontainingacache,theaddressblocksizeisthesameasthetransferblocksize,thoughweknowofoneexampleIBM74wheretheaddressblockcontainedtwotransferblocks.TheIBMSystem/360Model85Liptay68infactisaspecialcaseofthis,v/z.,adirect-mappedcache,wheretheModel85sector,consistingof1Kbytes,correspondstoouraddressblock.Eachsectorcontains16transferblocks,whichwerecalledsimplyblocks.4.4.2.TheEffectofLargeAddressBlocksTheuseofaddressblockslargerthantransferblocksmeansthatonlydatafromoneaddressblockin128backingstorecanoccupyanyofthetlocksmakingupanaddressblockinthecache.Therearecaseswheretheappropriatetransferblockisempty,butothertransferblocksinthesameaddressblockmustbepurgedsothatthenewaddressblockcanbeallocated.Weexaminedthisforvarioussizesofaddressblocksandfoundthatthemissratioincreasedveryslowlyuptoapoint.Forthesituationshowninfig.7,themissratiohadonlyrisenbyabout30%whentheaddressblockcon-tained64bytesforthePDP-11traces.ThatpointwasreachedfortheVAXtraceswhenitcontained32-byteaddressblocks.Wepredictedthatthebustrafficwouldcorrelatewellwithmissratiowithrespecttothisparameter.Tooursurprise,thebustrafficactuallydeclinedinitiallyasweincreasedtheaddressblocksize.Thedeclinewassmall,butconsistentforthePDP-11tracetapes,eventu-allyclimbingoverthebaselinewhentheaddressblockwas16or32transferblocks.Thephenomenonwassmaller,butdiscerniblefortheVAXtracesaswell,thoughinallcasesthebustrafficstartedincreasingsooner.Thissituationisshowninfig.8.Weinitiallysuspectedthatoursimulationmightbefaulty.Thatwasnotthecase,andeventuallywewereabletoexplainitandverifyit.Thewrite-oncealgorithmrequiresabusoperationwheneverablockismodifiedinitially(settoreserved.)However,reservationscouldbemadeonthebasisofeithertransferblocksoraddressblocks.Wehadputthechoiceintothesimulator,buthadnotexperimentedwithit,reservingattheaddressblocklevel.Thisinfactreducesthenumberofbuswritesnecessaryforreservationbecauseofspatiallocal-ityofwrites:anaddressblockalreadyfeservedneedonlybemarkeddirtywhenanytransferblockwithinitismodified.Thiswouldseemtoincreasegreatlythetrafficwhenevertheblockispurgedfromthecache,butinfacttheeffectissmall:onlythosetransferblockswhichhaveactuallybeenmodifiedneedbewrittenback.Wedemonstratedthatthiswasindeedresponsibleforthebehaviornotedbychangingthereservationstothetransferblocklevel.Thesimulationresultsthenexhibitedtheoriginallypredictedbehavior,correlatingcloselywiththemissratio.Weconcludethatminimumbustrafficisgeneratedwithminimumtransferblocksizes.Themissratiomaybesubstantiallyimprovedbyusingslightlylargertransferblocks,inwhichcasebustrafficdoesnotincreasegreatly.Usinglargeraddressblocksreducesthecostofthetagmemoryconsiderably.Itinitiallyhasonlyaminoreffectonmissratio,whichismorethanoffsetbythesavingsinwritesduetothemoreefficientreservationofmodifiedblocks4.5.OtherDesignAspectsStudied4.5.1.WriteAllocationWriteallocation,alsoknownasfetchonmte,meansthatablockisallocatedinthecacheonawritemissaswellasonareadmiss.Whileitseemsnaturalforwrite-back,ittypicallyisnotusedwithwrite-through.Itisessentialforwrite-oncetoassurecoherency.Ourearlysimulationsshowedthatitwashighlydesirableforwrite-backandwrite-once,andsuperiorevenforwrite-throughwithsmallblocks.Thiswastrueusingboththemeasuresofmissratioandbustraffic.Inallresultspresented,writeallocationwasemployed.4.5.AssociaUvityWerananumberofsimulationsvaryingtheassocia-tivityallthewayfromdirectmappedtofullyassociative.Whilethisisclearlyanimportantparameter,wehavelit-tlenewtoreport,i.e.,fullyassociativecacheisthebest,but2-waysetassociativeisnotmuchworse,and4-waysetassociativeissomewhereinbetween.(ButseeSmith83).Wehadhopedtofindthatahighdegreeofassoeiativitywouldimproveperformance,becausesuchanorganizationismuchmorefeasibleintheVLSIdomain,butresultswerenegative.Forresultsreportedherewehaveassumeda4-way,setassociativecache.4.5.3.ReplacementAlgorithmReplacementstrategyhasbeenthesubjectofanotherstudyusingthesamesimulatorandtracesSmith83.Inordertolimititssignificance,whichseemstobeorthogonaltotheissuesraisedhere,wehaveassumedtrueLRUreplacementamongtheelementsofeachsetinallcases.4.5.4.BusWidthThewidthofthedatapathsbetweenunitsisanimportantparameterinthatitiscloselyrelatedtobandwidth.WehavethecapabilitytospecifythebuswidthbothfrombackingstoretocacheandfromcachetoCPU.Forthepurposesofthisstudy,wehaveassumedinallcasesthattheVAXmemorysupplies4bytestothecacheinonebuscycle,whilethePDP-11memorysup-plies2bytes.An8-bytetransferthereforeiscountedastwocyclesfortheVAXand4forthePDP-11.Weassumedthatthecachesuppliedoneword-16bitsforthePDP-11and32bitsfortheVAX-totheCPUoneachrequest.However,theintelligenceofthepro-cessordetermineshowoftenthesamewordmustbefetched.Thetracetapescontainallmemoryreferences.Wefilteredthesewiththeassumptionthatoninstructionfetchesthesamewordwouldnotbefetchedwithoutaninterveninginstructionfetch.Nofilteringwasdoneondatafetches,5.SummaryOursimulationsindicatethatasingleboardcom-puterwitha4K-bytecachecanperformreasonablywellwithlessthan10%oftheaccessesrequiredtoitspri-marymemorywithoutacache.ThePDP-11tracessug-gestanumberaslowas3%.WhiletheVAXnumbersarehigher,additionaldeclineswillbeexperiencedbyincreasingthesizeofthecachebeyond4Kbytes.Animportantresultistheuseofthewrite-oncealgorithmtoguaranteeconsistentdataamongmultipleprocessors.Wehaveshownthatthisalgorithmcanbeimplementedinawaythatdegradesperformanceonlytrivially(ignoringactualcollisions,whicharerare),andperformsbetterthaneitherpurewrite-backorwrite-throughinmanyinstances.Theuseofsmalltransferblocksizescanbecoupledwithlargeaddressblockstobuildaninexpensivecachewhichperformseffectivelyintheabsenceoffrequentprocessswitches.Thelowbusutilizationandthesolu-tiontothestaledataproblemmakepossibleanenviron-mentforwhichthisconditionismet.Eventhoughthemissratioincreases,bustrafficinitiallydeclinesastheaddressblockisenlarged,holdingthetransferblockconstant.Thereforelargeraddressblocksshouldbeusedforreservingmemoryformodificationevenifsmallblocksareusedfortransferofdata.Theapproachadvocatedhereisappropriateonlyforasystemcontainingasinglelogicalmemory.Thisis.significantbecauseitdependsontheserializationofmemoryaccessestoassureconsistency.Ithasapplica-tionsbeyondthosestudiedhere,however.Forexample,theaccesspathtomemorycouldbeviaaringnetwork,oranyothertechniqueinwhicheveryrequestpasseseveryprocessor.Thisextensionseemsparticularlyapplicableattheformaintainingconsistencyforafilesystemoracommonvirtualmemorybeingsuppliedto129multipleprocessorsthroughacommonbussuchasEth-ernet.Clearlytherearemanyenvironmentsforwhichthismodelisinappropriate-responsetoindividualtasksmaybeunpredictable,forexample.However,webelievethatsuchaconfigurationhasmanypotentialapplica-tionsandcanheexploitedeconomicallyiftheappropri-ateVLSIcomponentsaredesigned.Wehaveinvestigatedthedesignofsuchcomponentsandbelievethattheyarebothfeasibleandwell-suitedforVLSIRavishankar83.OuranalysisindicatesthatthecacheapproachisreasonableforasystemwherebandwidthbetweentheCPUandmostofitsmemoryisseverelylimited.Wehavedemonstratedthroughsimulationofrealprogramsthatacachememorycanbeusedtosignificantlyreducetheamountofcommunicationaprocessorrequires.Whitewewereinterestedinthisforasingle-chipmicrocom-puterofthefuture,wehavealsodemonstratedthatsuchanapproachisfeasibleforoneormorecurrentlypopu-larcommercialmarkets.6.AcknowledgementsThismaterialisbaseduponworksupportedbytheNationalScienceFoundationunderGrantMCS-8202952.WethankDr.A.J.SmithforprovidingthePDP-11tracetapesuponwhichmuchofourearlyworkdepended.WealsowishtothankT.-H.Yangfordevelop-ingtheVAXtracefacility.P.VitaleandT.Doylecontri-butedmuchthroughdiscussionsandbycommentingonanearlydraftofthemanuscript.7.ReferencesAmdahl82C.Amdahl,privatecommuvticttiovt,March82.Bell74J.Bell,D.Casasent,andC.G.Bell,Aninvestiga-tionofalternativecacheorganizations,IEEETrans.onComputers,Vol.C-23,No.4,April1974,pp.348-351.Bell78C.BeU,J.Judge,J.McNamara,Computerengineeriy:aDECviezuofttardaresystemdesign,DigitalPress,Bedford,Mass.,1978.Censier78L.M.CensierandP.Feautrier,Anewsolu
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 新余市2023-2024学年高三年级二模(第二次模拟考试)地理试卷(含答案及解析)
- 2024年04月上海市房地产科学研究院招考聘用笔试历年高频考点试题后附答案详解
- 楼地面工程质量控制要点
- 2024-2034年中国智慧交通行业发展监测及投资战略规划研究报告
- FH公司人才流失原因及其对策研究
- 大数据营销 课件全套 第1-12章 大数据大数据营销概述- 大数据营销案例
- 股权投资业务尽职调查工作指引
- 期中测试(1-5单元)(试题)-2023-2024学年人教版数学二年级下册+
- 福建省泉州市永春侨中片区2023-2024学年八年级下学期期中考试英语试题+
- 2024年四川工程职业技术学院单招综合素质考试题库完整参考答案
- 冀教版五年级下册英语课件Lesson 19
- 【阿莫西林胶囊生产工艺设计和工艺计算7200字】
- 安徽红太阳生物化学有限公司年产0.5万吨二联吡啶和1万吨敌草快项目(二期)环评报告
- 超前水平钻施工方案
- 化妆品生产工艺验证报告
- 广州版英语五年级下册知识点汇总
- 小学一年级数学下册计算题天天练
- 2022-2023学年浙江省9+1高中联盟高一(下)期中英语试卷
- 全国优质课大赛一等奖人教版高中地理必修二《服务业区位因素及其变化》精美赛课课件
- 《汉字形体演变课》
- 2022年中国人口与发展研究中心招聘应届生笔试备考题库及答案解析
评论
0/150
提交评论