版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
DataStreamMiningandQueryingSlidestakenfromanexcellentTutorialonDataStreamMiningandQueryingbyMinosGarofalakis,
JohannesGehrkeandRajeevRastogi
AndbyMino’slecturesslidesfromthere:/cs286sp07/ProcessingDataStreams:MotivationAgrowingnumberofapplicationsgeneratestreamsofdataPerformancemeasurementsinnetworkmonitoringandtrafficmanagementCalldetailrecordsintelecommunicationsTransactionsinretailchains,ATMoperationsinbanksLogrecordsgeneratedbyWebServersSensornetworkdataApplicationcharacteristicsMassivevolumesofdata(severalterabytes)RecordsarriveatarapidrateGoal:Minepatterns,processqueriesandcomputestatisticsondatastreamsinreal-timeDataStreams:ComputationModelAdatastreamisa(massive)sequenceofelements:
StreamprocessingrequirementsSinglepass:EachrecordisexaminedatmostonceBoundedstorage:LimitedMemory(M)forstoringsynopsisReal-time:Perrecordprocessingtime(tomaintainsynopsis)mustbelowStreamProcessingEngine(Approximate)AnswerSynopsisinMemoryDataStreamsDataStreamProcessingAlgorithmsGenerally,algorithmscomputeapproximateanswersDifficulttocomputeanswersaccuratelywithlimitedmemoryApproximateanswers-DeterministicboundsAlgorithmsonlycomputeanapproximateanswer,butboundsonerrorApproximateanswers-ProbabilisticboundsAlgorithmscomputeanapproximateanswerwithhighprobabilityWithprobabilityatleast,thecomputedansweriswithinafactoroftheactualanswerSingle-passalgorithmsforprocessingstreamsalsoapplicableto(massive)terabytedatabases!
Sampling:BasicsIdea:AsmallrandomsampleSofthedataoftenwell-representsallthedataForafastapproxanswer,apply“modified”querytoSExample:selectaggfromRwhereR.eisodd
(n=12)
Ifaggisavg,returnaverageofoddelementsinSIfaggiscount,returnaverageoverallelementseinSofnifeisodd0ifeisevenUnbiased:Forexpressionsinvolvingcount,sum,avg:theestimatorisunbiased,i.e.,theexpectedvalueoftheansweristheactualanswerDatastream:935271658491SampleS:9518answer:5answer:12*3/4=9ProbabilisticGuaranteesExample:Actualansweriswithin5±1withprob0.9UseTailInequalitiestogiveprobabilisticboundsonreturnedanswerMarkovInequalityChebyshev’sInequalityHoeffding’sInequalityChernoffBoundTailInequalitiesGeneralboundsontailprobabilityofarandomvariable(thatis,probabilitythatarandomvariabledeviatesfarfromitsexpectation)
BasicInequalities:LetXbearandomvariablewithexpectationandvarianceVar[X].ThenforanyProbabilitydistributionTailprobabilityMarkov:Chebyshev:TailInequalitiesforSumsPossibletoderivestrongerboundsontailprobabilitiesforthesumofindependentrandomvariablesHoeffding’sInequality:LetX1,...,Xmbeindependentrandomvariableswith0<=Xi<=r.Letandbetheexpectationof.Then,foranyApplicationtoavgqueries:missizeofsubsetofsampleSsatisfyingpredicate(3inexample)risrangeofelementvaluesinsample(8inexample)TailInequalitiesforSums(Contd.)PossibletoderiveevenstrongerboundsontailprobabilitiesforthesumofindependentBernoullitrialsChernoffBound:LetX1,...,XmbeindependentBernoullitrialssuchthatPr[Xi=1]=p(Pr[Xi=0]=1-p).Letandbetheexpectationof.Then,forany,Applicationtocountqueries:missizeofsampleS(4inexample)pisfractionofoddelementsinstream(2/3inexample)
Remark:ChernoffboundresultsintighterboundsforcountqueriescomparedtoHoeffding’sinequalityComputingStreamSampleReservoirSampling[Vit85]:
MaintainsasampleSofafixed-sizeMAddeachnewelementtoSwithprobabilityM/n,wherenisthecurrentnumberofstreamelementsIfaddanelement,evictarandomelementfromSInsteadofflippingacoinforeachelement,determinethenumberofelementstoskipbeforethenexttobeaddedtoSConcisesampling[GM98]:DuplicatesinsampleSstoredas<value,count>pairs(thus,potentiallyboostingactualsamplesize)AddeachnewelementtoSwithprobability1/T(simplyincrementcountifelementalreadyinS)IfsamplesizeexceedsMSelectnewthresholdT’>TEvicteachelement(decrementcount)fromSwithprobability1-T/T’AddsubsequentelementstoSwithprobability1/T’StreamingModel:SpecialCases
Time-SeriesModelOnlyj-thupdateupdatesA[j](i.e.,A[j]:=c[j])
Cash-RegisterModelc[j]isalways>=0(i.e.,increment-only)Typically,c[j]=1,soweseeamulti-setofitemsinonepass
TurnstileModelMostgeneralstreamingmodelc[j]canbe>0or<0(i.e.,incrementordecrement)
ProblemdifficultyvariesdependingonthemodelE.g.,MIN/MAXinTime-Seriesvs.Turnstile!Linear-Projection(akaAMS)SketchSynopsesGoal:Buildsmall-spacesummaryfordistributionvectorf(i)(i=1,...,N)seenasastreamofi-valuesBasicConstruct:
RandomizedLinearProjectionoff()=projectontoinner/dotproductoff-vectorSimpletocomputeoverthestream:Addwheneverthei-thvalueisseenGenerate‘sinsmall(logN)spaceusingpseudo-randomgeneratorsTunableprobabilisticguaranteesonapproximationerrorDelete-Proof:Justsubtracttodeleteani-thvalueoccurrenceComposable:Simplyaddindependently-builtprojectionsDatastream:3,1,2,4,2,3,5,...Datastream:3,1,2,4,2,3,5,...f(1)f(2)f(3)f(4)f(5)11122where=vectorofrandomvaluesfromanappropriatedistributionExample:Binary-JoinCOUNTQueryProblem:ComputeanswerforthequeryCOUNT(RAS)Example:Exactsolution:tooexpensive,requiresO(N)space!N=sizeof(domain(A))DatastreamR.A:41241412032134DatastreamS.A:31242412213421=10(2+2+0+6)BasicAMSSketchingTechnique[AMS96]KeyIntuition:Userandomizedlinearprojectionsoff()todefinerandomvariableXsuchthatXiseasilycomputedoverthestream(insmallspace)E[X]=COUNT(RAS)Var[X]issmall
BasicIdea:Defineafamilyof4-wiseindependent{-1,+1}randomvariables
Pr[=+1]=Pr[=-1]=1/2Expectedvalueofeach,E[]=0Variablesare4-wiseindependentExpectedvalueofproductof4distinct=0Variablescanbegeneratedusingpseudo-randomgeneratorusingonlyO(logN)space(forseeding)!Probabilisticerrorguarantees(e.g.,actualansweris10±1withprobability0.9)AMSSketchConstructionComputerandomvariables:andSimplyaddtoXR(XS)wheneverthei-thvalueisobservedintheR.A(S.A)streamDefineX=XRXStobeestimateofCOUNTqueryExample:DatastreamR.A:412414DatastreamS.A:3124241202134122134213Binary-JoinAMSSketchingAnalysisExpectedvalueofX=COUNT(RAS)
Using4-wiseindependence,possibletoshowthat
isself-joinsizeofR(second/L2moment)01BoostingAccuracyChebyshev’sInequality:
BoostaccuracytobyaveragingoverseveralindependentcopiesofX(reducesvariance)
ByChebyshev:xxxAverageyBoostingConfidenceBoostconfidencetobytakingmedianof2log(1/)independentcopiesofYEachY=BernoulliTrial
Pr[|median(Y)-COUNT|COUNT](ByChernoffBound)=Pr[#failuresin2log(1/)trials>=log(1/)]yyycopiesmedian2log(1/)“FAILURE”:SummaryofBinary-JoinAMSSketchingStep1:Computerandomvariables:andStep2:DefineX=XRXSSteps3&4:AverageindependentcopiesofX;Returnmedianofaverages
MainTheorem(AGMS99):SketchingapproximatesCOUNTtowithinarelativeerrorofwithprobabilityusingspace
Remember:O(logN)spacefor“seeding”theconstructionofeachX
xxxAverageyxxxAverageyxxxAverageycopiescopiesmedian2log(1/)ASpecialCase:Self-joinSizeEstimateCOUNT(RAR)(originalAMSpaper)Second(L2)momentofdatadistribution,Giniindexofheterogeneity,measureofskewinthedata
Inthiscase,COUNT=SJ(R),sowegetan(e,d)-estimateusingspaceonly
Best-caseforAMSstreamingjoin-sizeestimationDistinctValueEstimationProblem:Findthenumberofdistinctvaluesinastreamofvalueswithdomain[0,...,N-1]Zerothfrequencymoment,L0(Hamming)streamnormStatistics:numberofspeciesorclassesinapopulationImportantforqueryoptimizersNetworkmonitoring:distinctdestinationIPaddresses,source/destinationpairs,requestedURLs,etc.Example(N=64)Datastream:305301751037Numberofdistinctvalues:5Assumeahashfunctionh(x)thatmapsincomingvaluesxin[0,…,N-1]uniformlyacross[0,…,2^L-1],whereL=O(logN)Letlsb(y)denotethepositionoftheleast-significant1bitinthebinaryrepresentationofyAvaluexismappedtolsb(h(x))MaintainHashSketch=BITMAParrayofLbits,initializedto0Foreachincomingvaluex,setBITMAP[lsb(h(x))]=1Hash(akaFM)SketchesforDistinctValueEstimation[FM85]x=5h(x)=101100lsb(h(x))=2000001BITMAP543210Hash(akaFM)SketchesforDistinctValueEstimation[FM85]Byuniformitythroughh(x):Prob[BITMAP[k]=1]=Prob[]=Assumingddistinctvalues:expectd/2tomaptoBITMAP[0],d/4tomaptoBITMAP[1],...LetR=positionofrightmostzeroinBITMAPUseasindicatoroflog(d)[FM85]provethatE[R]=,whereEstimated=Averageseveraliidinstances(differenthashfunctions)toreduceestimatorvariancefringeof0/1saroundlog(d)000001BITMAP000111111111position<<log(d)position>>log(d)0L-1[FM85]assume“ideal”hashfunctionsh(x)(N-wiseindependence) [AMS96]:pairwiseindependenceissufficienth(x)=,wherea,barerandombinaryvectorsin[0,…,2^L-1]Small-spaceestimatesfordistinctvaluesproposedbasedonFMideasDelete-Proof:Justusecountersinsteadofbitsinthesketchlocations+1forinserts,-1fordeletesComposable:Compon
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 缝纫洗烫设备项目立项申请报告
- 农用喷药机投资项目可行性分析报告
- 蜡缸恒温箱投资建设项目建议书
- 中考语文复习常见作文题型解析
- 认识古诗词及解词及韵律
- 中考复习中考语文复习复习的重点、思路和策略分析
- 2024年压实机械项目资金需求报告代可行性研究报告
- 中考复习语文课堂互动练习
- 工程物资回收合同
- 中考语文复习知识点注重整合
- 《政务处分法》VS《纪律处分条例》讲稿
- 2023学年完整公开课版同音字
- 企业感恩心态培训
- access密码破解方法
- 小儿麻醉特点及右美托咪定滴鼻
- 全过程工程咨询设计管理制度
- 人工智能在统计中的应用
- 文物勘察设计取费标准
- 幼儿园优质公开课:大班律动《超级玛丽》课件
- 5第五章历史时期的沙漠变迁
- 第四单元整合教学-学习演讲词-八年级语文下册课件
评论
0/150
提交评论