已阅读5页,还剩8页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
unsuspectedrelationshipswhichareofinterestorvaluetothedatabasesowners,ordataminers9.Duetothelargenumberofdimensionalityandthehugevolumeofdata,traditionalstatisticalmethodshavetheirlimitationsindatamining.Tomeetthechallengeofdatamining,articialintelligencebasedhumancomputerinteractivetechniqueshavebeenwidelyusedindatamining3,16.*ConceptualconstructiononincompletesurveydataShouhongWanga,*,HaiWangbaDepartmentofMarketing/BusinessInformationSystems,CharltonCollegeofBusiness,UniversityofMassachusettsDartmouth,285OldWestportRoad,NorthDartmouth,MA02747-2300,USAbDepartmentofComputerScience,UniversityofToronto,Toronto,ON,CanadaM5S3G4Received22March2003;receivedinrevisedform9September2003;accepted20October2003Availableonline26November2003AbstractTherawsurveydatafordataminingareoftenincomplete.Theissuesofmissingdatainknowledgediscoveryareoftenignoredindatamining.Thisarticlepresentstheconceptualfoundationsofdataminingwithincompletesurveydata,andproposesqueryprocessingforknowledgediscoveryandasetofqueryfunctionsfortheconceptualconstructioninsurveydatamining.Throughacase,thispaperdemonstratesthatconceptualconstructiononincompletedatacanbeaccomplishedbyusingarticialintelligencetoolssuchasself-organizingmaps.C2112003ElsevierB.V.Allrightsreserved.Keywords:Incompletesurveydata;Surveydatamining;Conceptualconstruction;Self-organizingmaps;Clusteranalysis;Knowledgediscovery;Queryprocessing1.IntroductionDataminingistheprocessoftrawlingthroughdatainthehopeofidentifyinginterpretablepatterns.D/locate/datakData&KnowledgeEngineering49(2004)311323Correspondingauthor.E-mailaddresses:(S.Wang),(H.Wang).0169-023X/$-seefrontmatterC2112003ElsevierB.V.Allrightsreserved.doi:10.1016/j.datak.2003.10.007aneectivemethodindealingwithhigh-dimensionaldata6,12.Moreimportantly,theSOMmethodprovidesabaseforthevisibilityofclustersofhigh-dimensionaldata.Thisfeatureisnot312S.Wang,H.Wang/Data&KnowledgeEngineering49(2004)311323availableinanyotherdataanalysismethods.Itallowsthedataminertoanalyzeclustersbasedontheproblemdomain.Surveyisoneofthecommondataacquisitionmethodsfordatamining4.Indatamining,onecanrarelyndasurveydatasetthatcontainscompleteentriesofeachobservationforallofthevariables.Commonly,surveysandquestionnairesareoftenonlypartiallycompletedbyrespon-dents.Theextentofdamageofmissingdataisunknownwhenitisvirtuallyimpossibletoreturnthesurveyorquestionnairestothedatasourceforcompletion,butisoneofthemostimportantpartsofknowledgefordataminingtodiscover.Infact,missingdataisanimportantdebatableissueintheknowledgeengineeringeld15.Inminingasurveydatabasewithincompletedatathroughclusteranalysis,patternsofthemissingdataaswellasthepotentialimpactsofthesemissingdataontheminingresultsareknowledge.Forinstance,adatamineroftenwishestoknowhowreliableaclusteranalysisis;whenandwhycertaintypesofvaluesareoftenmissing;whatvariablesarecorrelatedintermsofhavingmissingvaluesatthesametime.Thesevaluablepiecesofknowledgecanbediscoveredonlyafterthemissingpartofthedatasetisfullyexplored.Thispaperdiscussestheissueofmissingdatainminingsurveydatabasesforknowledgedis-covery,presentstheconceptualfoundationsofconceptualconstruction,andproposesasetofqueryfunctionsforconceptualconstructioninSOM-baseddatamining.Therestofthepaperisorganizedasfollows.Section2discussestheissuesofmissingdatarelatedtodatamining.Section3introducesSOMforconceptualconstructiononincompletedata.Section4suggestsfourconceptsasknowledgediscoveryindataminingwithincompletedata.ItprovidesaschemeofconceptualconstructiononincompletedatausingSOM.Section5proposesaquerytoolthatisusedtomanipulateSOMforconceptualconstruction.Section6presentsacasestudythatappliesthequerytooltomanipulatetheSOMfortheconceptualconstructiononastudentopinionsurveydataset.Finally,Section7oersconcludingremarks.2.IssuesofmissingdataIncompletedatasetsareubiquitousindatamining.Therehavebeenmanytreatmentsofmissingdata.Oneoftheconvenientsolutionstoincompletedataistoeliminatefromthedatasetthoserecordsthataremissingvalues.This,however,ignorespotentiallyusefulinformationinthoserecords.Incaseswheretheproportionofmissingdataislarge,theconclusionsdrawnfromthescreeneddatasetaremorelikelybiasedormisleading.Therehavebeenmanynon-statisticaltechniquesfordatamining.Theself-organizingmaps(SOM)methodbasedonKohonenneuralnetwork12isoneofthepromisingtechniques.SOM-basedclustertechniqueshaveadvantagesoverothermethodsfordatamining.Dataminingtypicallydealswithveryhigh-dimensionaldata.Thatis,anobservationinthedatabasefordataminingistypicallydescribedbyalargenumberofvariables.Thecurseofdimensionalityturnsstatisticalcorrelationsofdatainsignicant,andthusmakesstatisticalmethodspowerless.TheSOMmethod,however,doesnotrelyonanyassumptionsofstatisticaltests,andisconsideredasS.Wang,H.Wang/Data&KnowledgeEngineering49(2004)311323313Anothersimpleapproachofdealingwithmissingdataistousegenericunknownforallmissingdataitems.Indatamining,unspeciedunknownforallmissingdataitemsoftencausesconfusionandmisinterpretation.Thethirdsolutiontodealingwithmissingdataistoestimatethemissingvalueinthedataeld.Inthecaseoftimeseriesdata,interpolationbasedontwoadjacentdatapointsthatareobservedispossible.Ingeneralcases,onemayusesomeexpectedvalueinthedataeldbasedonstatisticalmeasures7.However,indatamining,surveydataarecommonlyofthetypesofranking,cat-egory,multiplechoices,andbinary.Interpolationanduseofanexpectedvalueforaparticularmissingdatavariableinthesecasesaregenerallyinadequate.Moreimportantly,research2indicatesthatameaningfultreatmentofmissingdatashallalwaysbeindependentoftheproblembeinginvestigated.Morerecently,therehavebeenmathematicalmethodsforndingtheaggregateconceptualdirectionsofadatasetwithmissingdata(e.g.,1,10).Thesemethodsmakethemselvesdistinctfromthetraditionalapproachesoftreatingmissingdatabyfocusingonthecollectiveeectsofthemissingdatainsteadofindividualmissingvalues.Thissuperiorfeatureofthesemethodscanbebestbuiltupfordataminingonincompletedata.However,thesestatisticalmethodshavelimi-tations.First,itisassumedthatmissingvaluesoccurinarandomfashionorfollowacertaindistributionfunctions.Theirstrongassumptionsaboutthedistributionsofdataareofteninvalidespeciallyforcasesofsurveywithincompletedata.Second,thesemathematicalmodelsaredata-driven,insteadofproblem-domain-driven.Infact,asinglegenericconceptualconstructionalgorithmisinsucienttohandleavarietyofgoalsofdataminingsinceagoalofdataminingisoftenrelatedtoitsspecicproblemdomain.Knowledgediscoveryindatabasesisthenon-trivialprocessofidentifyingvalid,novel,potentiallyuseful,andultimatelyunderstandablepatternsofdata8.Followingthisdenition,thisresearchemphasizestwoaspectsofconceptconstructionindataminingwithincompletedata.First,thecriteriaofvalidity,novelty,usefulnessoftheconceptstobeconstructedindataminingwithincompletedatacouldbeproblem-dependent.Thatis,theinterestofadatapatterndependsonthedatamineranddoesnotsolelydependontheestimatedstatisticalstrengthofthepattern14.Second,theconceptualconstructionbasedontheincompletedataisaccomplishedthroughheuristicsearchincombinatorialspacesbuiltoncomputerandhumancognitivetheories13.Humancomputercollaborationconceptconstructionistheinteractiveprocessbetweenthedataminerandcomputertoextractnovel,plausible,useful,relevant,andinterestingknowledgeassociatedwiththemissingdata.Inourview,dataminingdiersfromtraditionalstatisticsindealingmissingdatainmanyways.(1)Dataminingattemptstoextractunsuspectedandpotentiallyusefulpatternsfromthedataforthedataminerswithnovelgoalsrelatedtothemissingdata,ratherthantoestimatetheindi-vidualvaluesofthemissingdata.(2)Dataminingisahumancenteredprocessimplementedthroughknowledgediscoveryloopscoupledwithhumancomputerinteractiontoperceivetheimpactofthemissingdataatanaggregatelevel,ratherthanaone-waymathematicalderivationbasedonunveriedassump-tions.3.Toolforconceptualconstruction:self-organizingmaps(SOM)Givenalargesetofhigh-dimensionalsurveysamples,thereusuallybeasignicantnumberofobservationshavemissingvalues;however,notallmissingdataarerelevanttothedataminerC213sinterest.Hence,anysimplebrute-forcesearchmethodformissingdataisnotonlyinfeasibleforahugeamountofdata,butalsohelplesswhenthedatamineristoidentifyproblems,ordevelopconcepts,throughdatamining.Toidentifyproblemsordevelopconcepts,thedataminerneedsatooltoobserveunsuspectedpatternsoftheavailabledataandthemissingparts.Self-organizingmaps(SOM)12havebeenwidelyusedforclustering,sinceSOMaremorecomputationallyecientthanthepopulark-meansclusteringalgorithm.Moreimportantly,SOMprovidedatavisualizationforthedataminertoviewhigh-dimensionaldata11.Research14,16314S.Wang,H.Wang/Data&KnowledgeEngineering49(2004)311323indicatesthatSOMareeectiveindataminingfortheidenticationofunsuspectedpatternofthedata.Specically,SOMcanbeusedforclusteranalysisonmultivariatesurveydata.ThisstudytakesonestepfurtherandusesSOMasatoolforconceptconstructionrelatedtomissingdata.Conceptualconstructiononincompletedataistoinvestigatethepatternsofthemissingdataaswellasthepotentialimpactsofthesemissingdataontheminingresultsbasedonlyonthecompletedata.Asseenlaterinourillustrativeexamples,SOMprovideamechanismforhumancomputercollaborationtoconstructconceptsfromthedatawithmissingvalues.SOMcanlearncertainusefulfeaturesfo
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 用我眼中的年轻人为题写一篇英语作文
- 2022-2023学年高二物理竞赛课件:动点、动系和定系的选择原则
- 《建设法规与案例分析》 课件 第8章 建筑法律制度
- 店经理储备店经理绩效面谈表
- 2022-2023学年高二物理竞赛课件:波函数的统计诠释
- TTT内训师授课技巧培训
- 冀教版二年级科学下册全册课件(完整版)
- 新部编版九年级语文上册期末练习题
- 部编版九年级语文上册期末考试题含答案
- 2022新人教版五年级上册《道德与法治》期末测试卷【附答案】
- 30万吨污水处理厂初步设计(论文资料)
- 配偶户口调京(央属企事业单位)有关规定
- 毕业设计(论文)基于数字集成电路的四人抢答器的设计时序控制电路
- 物料搬运机械手结构设计
- 望江名都停车场服务费用测算
- 冀教版四年级英语上册说课稿 unit 1 lesson 5 where is danny
- 化工原材料控制标准
- 巨细胞病毒感染诊疗指南(完整版)
- 新华实验小学创建书法特色学校汇报材料
- 配电箱电路图 - 副本
- 基于红外传感器的智能灯控装置设计毕业论文.pdf
评论
0/150
提交评论