From-Data-Fusion-to-Knowledge-Fusion.ppt

资源ID：90975746 资源大小：1.31MB 全文页数：29页
资源格式： PPT 下载积分：20积分

扫码快捷下载

会员登录下载

微信登录下载

三方登录下载：

微信扫一扫登录

手机扫码下载

请使用微信或支付宝扫码支付

• 扫码支付后即可登录、下载文档，同时代表您同意《人人文库网用户协议》

• 扫码过程中请勿刷新、关闭本页面，否则会导致文档资源下载失败

• 支付成功后，可再次使用当前微信或支付宝扫码免费下载本资源，无需再次付费

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源（1积分=1元）下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

From-Data-Fusion-to-Knowledge-Fusion.ppt

,From Data Fusion to Knowledge Fusion,The task of data fusion is to identify the true values of data items (e.g., the true date of birth for Tom Cruise) among multiple observed values drawn from different sources (e.g., Web sites) of varying (and unknown) reliability.Knowledge fusion identifies true subject-predicate-object triples extracted by multiple information extractors from multiple information sources.,Introduction,Extractor To build a knowledge base, we employ multiple knowledge extractors to extract base. This involves three key steps: Identifying which parts of the data indicate a data item and its value. Linking any entities that are mentioned to the corresponding entity identifier. Linking any relations that are mentioned to the corresponding knowledge base schema.,Some concepts,subject-predicate-object triples We can define the form (subject,predicate,object) as a linkage of some entities and relations. e.g.,(Tom Cruise, date_of_birth, 7/3/1962),Some concepts,We define the knowledge fusion problem, and to adapt existing data fusion techniques to solve this problem. We suggest some simple improvements to existing methods that substantially improve their quality. We make a detailed error analysis of our methods, and a list of suggested directions for future research to address some of the new problems raised by knowledge fusion.,Contribution,Our goal is to build a high-quality Web-scale knowledge base.The figure depicts the architecture of our system.,System architecture,Voting: Among conflicting values, each value has one vote from each data source, and we take the value with the highest vote count. Quality-based: Quality-based methods evaluate the trustworthiness of data sources and accordingly compute a higher vote count for a high-quality source. Relation-based: Relation-based methods extend quality-based methods by additionally considering the relationships between the sources.,Data fusion method,We follow the data format and ontology in Freebase, and store the knowledge as (subject, predicate, object) triples, because data in Freebase is structured but not triples. Note that in each triple the (subject, predicate) pair corresponds to a “data item” in data fusion, and the object can be considered as a “value” provided for the data item, like the (key,value) pair. So our goal is find (that means extract them by extractors) new facts abouts subject and predicate.,Knowledge base,Now we need to crawl a large set of Web pages and extract knowledge from four types of Web contents.There are some source types:TXT,DOM,TBL and ANO. Contributions from Web sources are highly skewed: the largest Web pages each contributes 50K triples while half of the Web pages each contributes a single triple.,Web sources,3 tasks in knowledge extraction triple identification: deciding which words or phrases describe a triple. entity linkage: deciding which Freebase entity a word or phrase refers to. predicate linkage: to decide which Freebase predicate is expressed in the given piece of text(beacuse predicates are implicit).,Extractors,Evaluating the quality of the extracted triples requires a gold standard that contains true triples and false triples and Freebase uses closed-world assumption to make it.However, this assumption is not always valid because of facts missing.Instead, we use local closed-world assupmtion(LCWA). Some of the erroneous triples are due to wrong information provided by Web sources whereas others are due to mistakes in extractions,but extractions are responsible for the majority of the errors(more than 96% errors are provided by extractors).,Quality of extracted knowledge,The more Web sources from which we extract a triple, or the more extractors that extract a triple, the more likely the triple is true. But there can be exceptions:,Quality of extracted knowledge,Given a set of extracted knowledge triples, each associated with provenance information such as the extractor and the Web source, knowledge fusion computes for each unique triple the probability that it is true. 3 challenges: The input of knowledge fusion is three-dimensional(the third is extractors). The output of knowledge fusion is a truthfulness probability for each triple. The scale of knowledge is typically huge.,Knowledge fusion,VOTE: For each data item, VOTE counts the sources for each value and trusts the value with the largest number of sources. ACCU: For each source S that provides a set of values VS, the accuracy of S is computed as the average probability for values in VS. For each data item D and the set of values VD provided for D, the probability of a value v VD is computed as its a posterior probability conditioned on the observed data using Bayesian analysis.,Adapting data fusion techniques,Three assumptions about ACCU： For each D there is a single true value. There are N uniformly distributed false values. The sources are independent of each other. By default we set N = 100 and accuracy A = 0.8. POPACCU: POPACCU(more robust than ACCU) extends ACCU by removing the assumption that wrong values are uniformly distributed; instead, it computes the distribution from real data and plugs it in to the Bayesian analysis.,Adapting data fusion techniques,Adaptations: We reduce the dimension of the KF input by considering each (Extractor, URL) pair as a data source, which we call a provenance. For ACCU and POPACCU, we simply take the probability computed by the Bayesian analysis. For VOTE, we assign a probability as follows: if a data item D = (s; p) has n provenances in total and a triple T = (s; p; o) has m provenances, the probability of T is p(T) = m/n. We scale up the three methods using a MapReduce-based framework.,Adapting data fusion techniques,Calibration curve: We plot the predicted probability versus the real probability and divide the triples into l + 1 buckets. Here we use l = 20 when we report our results. We compute the real probability for each bucket as the percentage of true triples in the bucket compared with our gold standard. We summarize the calibration using two measures. The deviation computes the average square loss between predicted probabilities and real probabilities. The weighted deviation is the same except that it weighs each bucket by the number of triples in the bucket.,Experimental evaluation,Experimental evaluation,The basic models consider an (Extractor, URL) pair as a provenance. Now maybe we can vary the granularity. The page-level and site-level. The predicate-level and all triples. The parttern-level and extractor-level.,Granularity of provenances,Granularity of provenances,We consider filtering provenances by two criteria the coverage and the accuracy. We compute triple probabilities for data items where at least one triple is extracted more than once, and then re-evaluate accuracy for each provenance. We ignore provenances for which we still use the default accuracy. We use a threshold on accuracy to ingore some provenances, while this method could cause a problem: we may lose all provenances so cannot predict the probability for any triple.,Provenance selection,Provenance selection,Leveraging the gold standard(by Freebase),Filter provenances by coverage. Change provenance granularity to (Extractor, pattern, site, predicate). Filter provenances by accuracy ( = .5). Initialize source accuracy by the gold standard.,Putting them all together,Improving the existing methods,Choose 20 false positives and 20 false negatives to check their provenance.,Error Analysis,Distinguishing mistakes from extractors and from sources Considering hierarchical value spaces Leveraging confidence of extractions Improving the closed world assumption,Future Directions,THANKS,Q & A,It is a kind of methodology: use a precise and machine processable method, to define the self-existent concept(object or entity). This methodology is similar to a fundamental modeling process and the output of this process is some models which we call ontologies.,Ontology,

注意事项

本文（From-Data-Fusion-to-Knowledge-Fusion.ppt）为本站会员（1320****408）主动上传，人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知人人文库网（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。