Tài liệu Best probability queries on probabilistic databases

.PDF

144

136

sharebook Báo vi phạm

Tải xuống 136

Mô tả:

BEST PROBABILITY QUERIES ON PROBABILISTIC DATABASES Trieu Minh Nhut Le A thesis submitted in total fulﬁlment of the requirements for the degree of Doctor of Philosophy School of Engineering and Mathematical Sciences Faculty of Science, Technology and Engineering La Trobe University Bundoora, Victoria 3086 Australia January 2014 2 Acknowledgements I have been given invaluable support by several people during PhD candidature. I would like to take this opportunity to express my gratitude to them. First and foremost, this thesis could not have come into existence without the tremendous support and patient supervision from my supervisor. I would like to thank Dr Jinli Cao for her endless support. I sincerely appreciate her valuable advice and guidance to my research. Second, I would like to thank Dr Zhen He at La Trobe University for providing insightful ideas, guidance, and comments on my research. I have been fortunate to collaborate with him on various work and have learnt precious skills in research paper writing. He has treated me more like a friend than a student and has always oﬀered sound advice whenever I needed it. Third, I would like to thank my co-supervisor Prof. Wenny Rahayu for her role as a member of my Research Committee and for providing helpful feedback at every stage of my Ph.D. I thank Ms. Michele Mooney for her careful proof reading of my research papers and the ﬁnal draft of this thesis. Last but not least, I would like to express my gratitude and love to my family for always being there and when I needed them most, and for supporting me throughout my life. Especially, I would like to thank my mum, Tiet Thi Phan, for her continuous love, care, and support. I would like to thank my lover, Tuyen Mong Do, for her love, care and encouragement. i Abstract This thesis focuses on answering probabilistic top-k and skyline queries on probabilistic data using the possible worlds semantics model. These are two of the most important queries for decision support systems. Almost all other existing methods for answering queries on probabilistic data require the user to set a probability threshold. However, it is diﬃcult to set a threshold because if it is set too high, important results may be lost, but if it is set too low, a lot of low quality results may be returned. In this thesis, novel approaches for answering probabilistic top-k and skyline queries are proposed using the dominance principle as natural and eﬀective methods to select results of queries with an acceptable number of answers, ensuring all important answers are captured without the need to set a threshold. There are three challenges to answering both probabilistic top-k and skyline queries. The ﬁrst challenge is to develop novel probabilistic top-k and skyline queries using the dominance principle to return only the most interesting results. The second challenge is to develop formulas based on probabilistic theory to directly calculate the probabilities of the results without considering any possible worlds and to also develop algorithms to eﬀectively prune the search space. The third challenge is to ensure that all the semantic properties of the probabilistic queries are covered. The evaluations of the performance of the proposed approaches show that, ﬁrstly, the results of the queries are not only very reasonable in size but also capture all the important answers. Secondly, the proposed algorithms outperform the current algorithms by accelerating the pruning search space, thereby reducing execution time. Lastly, all the semantic properties of probabilistic queries are covered. Statement of Authorship Except where reference is made in the text of this thesis, this thesis contains no material published elsewhere or extracted in whole or in part from a thesis submitted for the award of any other degree or diploma. No other person’s work has been used without due acknowledgment in the main text of the thesis. This thesis has not been submitted for the award of any degree or diploma in any other tertiary institution. Trieu Minh Nhut Le Date: iii 0. STATEMENT OF AUTHORSHIP iv External Refereed Publications Trieu Minh Nhut Le, Jinli Cao, Top-k best probability queries on probabilistic data, Proceedings of the 17th International Conference on Database Systems for Advanced Applications Volume Part II, DASFAA’12, Springer-Verlag, Berlin, Heidelberg, 2012, pp. 116. Trieu Minh Nhut Le, Jinli Cao, and Zhen He, Top-k best probability queries and semantics ranking properties on probabilistic databases. Data & Knowledge Engineering, 2013. Trieu Minh Nhut Le, Jinli Cao, and Zhen He, Answering skyline queries on probabilistic data using the dominance of probabilistic skyline tuples. The ACM Transactions on Database Journal, under reviewed, August 2013. v 0. EXTERNAL REFEREED PUBLICATIONS vi Contents Acknowledgements i Statement of Authorship iii External Refereed Publications v List of Figures xii 1 Introduction 1 1.1 Uncertain data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Querying probabilistic data . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Answering probabilistic top-k queries . . . . . . . . . . . . . . 2 1.2.2 Answering probabilistic skyline queries . . . . . . . . . . . . . 6 1.3 1.4 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 Contribution on answering probabilistic top-k queries . . . . . 11 1.3.2 Contribution on answering probabilistic skyline queries . . . . 12 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Background 2.1 2.2 2.3 15 Probabilistic database models . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 The uncertain object model . . . . . . . . . . . . . . . . . . . 15 2.1.2 The possible worlds semantics model . . . . . . . . . . . . . . 16 Queries on databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 The top-k queries on data . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Skyline queries on data . . . . . . . . . . . . . . . . . . . . . . 19 Summary of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 vii CONTENTS 3 Existing work on probabilistic queries 3.1 3.2 3.3 25 Answering top-k queries on probabilistic data. . . . . . . . . . . . . . 25 3.1.1 The uncertain-top-k queries. . . . . . . . . . . . . . . . . . . . 26 3.1.2 The uncertain-k-rank . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3 The probabilistic threshold top-k query . . . . . . . . . . . . . 28 3.1.4 The global-top-k query. . . . . . . . . . . . . . . . . . . . . . . 29 3.1.5 The expected-score query . . . . . . . . . . . . . . . . . . . . 30 3.1.6 The expected-rank . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.7 The robust rank . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Evaluating probabilistic top-k queries on semantic properties. . . . . 33 3.2.1 Semantic properties for top-k probabilistic queries . . . . . . . 33 3.2.2 Analysing the answers to probabilistic top-k queries . . . . . . 34 Answering skyline queries on uncertain data . . . . . . . . . . . . . . 36 3.3.1 Answering skyline queries on incomplete data . . . . . . . . . 37 3.3.2 Answering probabilistic skyline queries using the uncertain object model . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.3 3.4 Computing all skyline probabilities for uncertain data . . . . . 39 Summary of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4 Answering top-k best probability queries 4.1 4.2 4.3 4.4 41 Motivation and our proposal. . . . . . . . . . . . . . . . . . . . . . . 41 4.1.1 Problem deﬁnition . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.3 Calculation of top-k probability . . . . . . . . . . . . . . . . . 46 4.1.4 Calculation of top-k probability with a generation rule . . . . 49 The top-k best probability query . . . . . . . . . . . . . . . . . . . . 52 4.2.1 Deﬁnition of the top-k best probability query . . . . . . . . . 52 4.2.2 Signiﬁcance of top-k best probability query . . . . . . . . . . . 54 4.2.3 Finding top-k best probability and pruning rules . . . . . . . . 55 4.2.4 The top-k best probability algorithm . . . . . . . . . . . . . . 59 Semantics of top-k best probability query and other top-k queries. . . 60 4.3.1 Semantics of ranking properties . . . . . . . . . . . . . . . . . 60 4.3.2 Top-k queries satisfying semantic properties . . . . . . . . . . 62 Experimental study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4.1 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4.2 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 viii CONTENTS 4.5 Summary of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 Answering the best probabilistic skyline queries 5.1 5.2 5.3 5.4 5.5 73 Motivation and our proposal. . . . . . . . . . . . . . . . . . . . . . . 73 5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.1.2 Our proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Answering the bestpro-skyline query. . . . . . . . . . . . . . . . . . . 77 5.2.1 Naı̈ve deﬁnition of the probabilistic skyline query . . . . . . . 77 5.2.2 The bestpro-skyline query . . . . . . . . . . . . . . . . . . . . 78 Overview of a Naı̈ve solution compared to our solution. . . . . . . . . 79 5.3.1 Naı̈ve solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.2 Our solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Calculating skyline probability without enumerating all possible worlds 82 5.4.1 Formula for calculating the skyline probability of a tuple . . . 82 5.4.2 Handling generation rules . . . . . . . . . . . . . . . . . . . . 83 Bestpro-skyline Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5.1 Pruning the search using the Guard value of pivot tuples . . . 88 5.5.2 Nearest Neighbour-based bestpro-skyline (NN-BPS) algorithm 92 5.5.3 The skyline result-based bestpro-skyline (SKY-BPS) algorithm 97 5.5.4 Analysis of pruning performance of NN-BPS algorithm against SKY-BPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.6 5.7 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.6.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.6.2 Algorithm setup . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6.3 Measurement metric . . . . . . . . . . . . . . . . . . . . . . . 107 5.6.4 Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.7.1 The results of bestpro-skyline query on real data . . . . . . . . 108 5.7.2 Comparing Naı̈ve-Threshold and NN-BPS using real data . . . 110 5.7.3 The pruning eﬀectiveness of NN-BPS and SKY-BPS using the synthetic data set. . . . . . . . . . . . . . . . . . . . . . . . . 111 5.7.4 5.8 Measuring pruning eﬀectiveness as a function of time . . . . . 114 Summary of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 ix CONTENTS 6 Conclusion 119 6.1 Thesis summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2 Conclusion and Key Findings . . . . . . . . . . . . . . . . . . . . . . 121 6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Bibliography 130 x List of Figures 2.1 Probabilistic database . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Skyline tuples t1 and t3 of Table 1.4 . . . . . . . . . . . . . . . . . . . 21 2.3 Skyline process using the nearest neighbour algorithm . . . . . . . . . 21 3.1 A set of uncertain objects . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 The answer to the PT-10 vs. thresholds. . . . . . . . . . . . . . . . . 67 4.2 Accessed tuples vs. k . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3 Tuples in answer vs. k . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 Accessed tuples vs. data sets . . . . . . . . . . . . . . . . . . . . . . . 69 4.5 Tuples in answer vs. data sets . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Probabilistic data of Table 5.1 graphed in attribute space . . . . . . . 81 5.2 Nearest neighbour ordered pivot tuple selection . . . . . . . . . . . . 82 5.3 Skyline ordered pivot tuple selection . . . . . . . . . . . . . . . . . . 82 5.4 Two heuristic algorithms based on our approach of using the Guard value of pivot tuples to prune the search space. . . . . . . . . . . . . 82 5.5 Probabilistic data of Table 5.4 graphed in attribute space . . . . . . . 84 5.6 Replacement of generation rules with compound tuples . . . . . . . . 86 5.7 An example of a probabilistic data set with generation rules . . . . . 87 5.8 The dominating tuple sets DORtp and DORte of tp and te , respectively. . 92 5.9 The region scanned by NN-BPS when processing the DOtp set . . . . 96 5.10 The region scanned by SKY-BPS when processing the DOtp set . . . 96 5.11 Example of search order for NN-BPS versus SKY-BPS . . . . . . . . 102 5.12 Total execution time of varying data set size of real data . . . . . . . 110 5.13 Size of result set of varying data set size of real data . . . . . . . . . . 110 5.14 Maximum tuple probability of 0.5 of varying data set size of synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 xi LIST OF FIGURES 5.15 Maximum tuple probability of 1.0 of varying data set size of synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.16 Maximum tuple probability of 0.5 of varying data set size of synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.17 Maximum tuple probability of 1.0 of varying data set size of synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.18 Maximum tuple probability of 0.5 of varying number of dimensions . 115 5.19 Maximum tuple probability of 1.0 of varying number of dimensions . 115 5.20 Maximum tuple probability of 0.5 of varying number of dimensions . 115 5.21 Maximum tuple probability of 1.0 of varying number of dimensions . 115 5.22 Maximum tuple tion of time . . 5.23 Maximum tuple tion of time . . probability of 0.5 . . . . . . . . . . probability of 1.0 . . . . . . . . . . of pruning . . . . . . of pruning . . . . . . eﬀectiveness as a func. . . . . . . . . . . . . . 115 eﬀectiveness as a func. . . . . . . . . . . . . . 115 5.24 Results of varying the maximum tuple probability . . . . . . . . . . . 116 xii Chapter 1 Introduction Many business intelligence and decision making applications derive reliable results from querying uncertain data. A popular way of modeling uncertain data is to use the possible worlds semantics model. The purpose of this thesis is to propose two novel approaches to eﬃciently process and answer probabilistic top-k and skyline queries on probabilistic data using the possible worlds semantics model. 1.1 Uncertain data Uncertainty in data occurs in many real world sources, such as sensor data, network data, survey data etc. Uncertain data can arise due to the limited accuracy of the equipment, noise, a lack of information, and errors or delays in data transmission. These data can be incomplete, and/or contain imprecise information. For example, to control moving objects, sensors are used to detect the direction, speed, and/or position of these objects. The data which is collected by this system is usually uncertain data, because there may be errors in the transmission of the data which can be caused by the high voltage environment, noise, or bad weather. Probability information can be used to indicate the conﬁdence in measurements taken from such devices as cameras, radars, satellites etc. In data integration, data from diﬀerent sources are combined to provide an uniﬁed view of data to users. The result of the data integration process is usually uncertain data due to missing functional annotations, non-uniform conceptualization, or an unstructured data repository of data mappings [16] [40] [21]. For example, integrating the life science database (biology, enterprises, government, libraries database) is diﬃcult due to the large number of diverse, interrelated data sources, which is usually represented using uncertain data. Therefore, processing uncertain 1 1. INTRODUCTION data is the largest obstacle faced by these applications [18] [16]. Data may be incomplete, may contain errors, or may be represented using probabilistic information. We call these types of data uncertain data. A number of important applications on decision making and market surveillance need to process uncertain data [4] [40] [34] [17] [42]. There has been much work on modeling uncertain data [45], [15], managing uncertain data [63], [4], ranking queries [24], [51], [59], [37], [4], [57], [64], nearest neighbour queries [10], [29], [11], skyline queries on uncertain data [6], [46] etc. It is challenging to obtain accurate, reliable, and meaningful results from uncertain data. 1.2 Querying probabilistic data The main requirement of all practical applications is that probabilistic queries on probabilistic data must be processed eﬃciently and return accurate and reliable results [24] [46] [11]. Top-k and skyline queries are important tools for data exploration on data mining, decision making, and market analysis applications [27] [53] [9] [46] [24] [35]. In a relational database system, probabilistic queries such as skyline or top-k queries select interesting and reliable results from the various alternatives within the probabilistic data. Therefore, this thesis focuses on answering probabilistic top-k queries and probabilistic skyline queries on probabilistic data using the possible worlds semantics model. 1.2.1 Answering probabilistic top-k queries Top-k queries can be used to help investors make business decisions such as choosing projects which have the top-k highest proﬁts. On probabilistic databases, top-k queries can be answered by listing all possible worlds [24] [23] [50] [51] [2] [57]. A possible world contains a number of tuples in the probabilistic data set. Each possible world can contain k tuples associated with a non-zero probability for existence. Diﬀerent possible worlds can contain diﬀerent sets of k highest tuples. Therefore, it is necessary to list all possible worlds to select the top-k answer. For example, in business, investors often make decisions about various products based on analysis, statistical data and mining data [42], which provide predictions relating to potentially successful and unsuccessful projects. To analyse the market, investors ﬁrstly collect the historical statistical data, and then use the data to predict future market trends with probabilistic predictions. This is known as prob- 2 1.2 Querying probabilistic data abilistic data. For example, assume that the data in Table 1.1 have been collected and analysed statistically, according to historical data resources [41]. Each tuple represents an investment project of USD $100 to produce a speciﬁc product (Product ID). Investing in products based on their probabilities (Probability) will result in an estimated amount of proﬁt. In tuple t1 , a businessman invests USD $100 on product A, and it has a 0.29 chance of obtaining a proﬁt of USD $25. Tuple t1 t2 t3 t4 t5 t6 Product ID A B E B C E Proﬁt of USD $100 investment Probability 25 0.29 18 0.3 17 0.8 13 0.4 12 1.0 11 0.2 Table 1.1: Predicted Proﬁts of USD $100 investment on Products In the real world, when analysing historical data, predictions on future market trends return two or more values per product with probabilities that the predictions are correct. Therefore, some tuples in Table 1.1 have the same product ID with diﬀerent proﬁts. In the probabilistic data model, these tuples are mutually exclusive, and controlled by a set of rules (generation rule) [24] [23] [25] [48]. For example, tuples t2 and t4 are projects that invest in product B with a 0.3 probability of producing a USD $18 proﬁt and 0.4 probability of producing a USD $13 proﬁt, respectively. In this case, if the prediction for tuple t2 is true, then the prediction for tuple t4 will not be true. It is impossible for both proﬁts to be true for the same product ID. They are mutually exclusive predictions. In Table 1.1, the probabilistic data are restricted by the exclusive rules R1 = t2 ⊕ t4 and R2 = t3 ⊕ t6 . Top-k queries can be used to help investors make business decisions such as choosing projects which have the top-2 highest proﬁts. On probabilistic databases, top-k queries can be answered by using the probability space that enumerates the list of all possible worlds [24] [23] [50] [51] [2] [57]. A possible world contains a number of tuples in the probabilistic data set. Each possible world has a non-zero probability for existence and can contain k tuples with the highest proﬁts. Diﬀerent possible worlds can contain diﬀerent sets of k tuple answers. Therefore, it is necessary to list all possible worlds of Table 1.1 to ﬁnd the top-2 answers for the top-2 query on the probabilistic database. Thus, Table 1.2 lists three dimensions: the possible world, the probability of existence, and the top-2 tuples in each possible world. 3 1. INTRODUCTION Possible world Probability of existence Top-2 tuples W1 = {t1 , t2 , t3 , t5 } 0.0696 t1 , t2 W2 = {t1 , t2 , t5 , t6 } 0.0174 t1 , t2 W3 = {t1 , t3 , t4 , t5 } 0.0928 t1 , t3 W4 = {t1 , t4 , t5 , t6 } 0.0232 t1 , t4 W5 = {t1 , t3 , t5 } 0.0696 t1 , t3 W6 = {t1 , t5 , t6 } 0.0174 t1 , t5 W7 = {t2 , t3 , t5 } 0.1704 t2 , t3 W8 = {t2 , t5 , t6 } 0.0426 t2 , t5 W9 = {t3 , t4 , t5 } 0.2272 t3 , t4 W10 = {t4 , t5 , t6 } 0.0568 t4 , t5 W11 = {t3 , t5 } 0.1704 t3 , t5 W12 = {t5 , t6 } 0.0426 t5 , t6 Table 1.2: List of all possible worlds and top-2 tuples of probabilistic data shown in Table 1.1 According to Table 1.2, any tuple has a probability of being in the top-2. Therefore, Table 1.3 lists the tuples, proﬁt, probability, and top-2 probability to analyse the top-2 answers of probabilistic databases. The top-2 probability of a tuple is aggregated by the sum of its probabilities of existence in the top-2 in Table 1.2. Tuple t1 t2 t3 t4 t5 t6 Proﬁt Probability Top-2 probability 25 0.29 0.29 18 0.3 0.3 17 0.8 0.7304 13 0.4 0.3072 12 1.0 0.3298 11 0.2 0.0426 Table 1.3: List of top-2 probabilities of tuples of probabilistic data shown in Table 1.1 Problems with existing approaches. In previous research [24] [23], the top-k answers are found using the probability threshold approach called PT-k. PT-k query returns a set of tuples with top-k probabilities greater than or equal to the users’ threshold value. For example, the answer to the PT-2 query with threshold 0.3 in the example listed in Table 1.3 is the set containing 4 tuples {t2 , t3 , t4 , t5 }. We have identiﬁed three drawbacks with PT-k query. These are listed below: • PT-k query may lose some important results. According to the PT-2 query, tuple t1 (25, 0.29) is eliminated by the PT-2 algorithm because its top-2 probability is less than threshold 0.3. In this case, we recommend that tuple t1 4 1.2 Querying probabilistic data should be in the result, the reason being that tuple t1 (25, 0.29) is not worse than tuple t4 (13, 0.3072), when comparing both attributes of proﬁt and top-2 probability. That is, t1 .proﬁt (25) is signiﬁcantly greater than t4 .proﬁt (13) and t1 .top-2 probability (0.29) is slightly less than t2 .top-2 probability (0.3072). In business, investors may like to choose project t1 because they can earn nearly double the proﬁt compared to project t4 , while project t1 is only slightly riskier than project t4 with a top-2 probability of 0.0172. Therefore, t1 should be included in the top-k answers. • PT-k answers may contain some redundant tuples which should be eliminated from the top-k results earlier in the query process. Referring to Table 1.3 again, tuples t4 and t5 should be eliminated immediately from the answer because the values of both attributes, proﬁt and top-2 probability in t4 (13, 0.3072) and t5 (12, 0.3298), are less than those in t3 (18, 0.7304). It is obvious that investors will choose project t3 which is more dominant than projects t4 and t5 in both proﬁt and top-k probability. • It is diﬃcult for users to choose a threshold for the PT-k method. The threshold is a crucial factor in enhancing the eﬃciency and eﬀectiveness of PT-k query [24] [23]. Users may not know much about probabilistic values and probabilistic databases. Therefore, it may be diﬃcult for users to ﬁnd the most suitable threshold initially. This, in turn, means they may need to waste time using trial and error to ﬁnd the most suitable threshold value. Our Approach. In this thesis, we deﬁne a new probabilistic top-k query called the “top-k best probability query” which overcomes the three drawbacks mentioned above. The top-k best probability query takes both top-k proﬁt and top-k probability of all possible worlds into account when selecting the best top-k answers. Below, we list the desirable properties of the top-k best probability queries, and how we achieve these properties: • Inclusion of important tuples and elimination of redundant tuples: users are usually interested in projects with the highest proﬁts. Therefore, the k-tuples with the highest proﬁts (scores) should be in the answer set of the top-k queries on probabilistic data. For example, in Table 1.1, tuples t1 and t2 have the top2 best proﬁts and therefore should be in the top-2 answer. In addition, we use the dominance principal to address both selecting the important tuples (nondominated tuples) and eliminating the redundant tuples (dominated tuples). 5 1. INTRODUCTION The tuples returned are the ones that are not dominated by any other tuples in terms of both score and probability. The non-dominated tuples will have “the best top-k probabilities”. For example, tuple t3 is a non-dominated tuple which is added into the result because it has a greater top-2 probability than all other tuples in the top-2-highest scores. Then, the rest of the tuples t4 , t5 , t6 in the data set are eliminated, because they are dominated by tuple t3 in both proﬁt and top-2 probability. • Removal of the unclear threshold : The new method which combines the two previous techniques will remove the need of the threshold for processing the top-k queries on probabilistic databases. Therefore, the set {t1 , t2 , t3 } is the answer to the top-2 query on Table 1.1. The top-2 best probability query returns not only the tuples with the best top-2 ranking scores but also the top-2 highest probabilities to the users. 1.2.2 Answering probabilistic skyline queries In many increasingly important ﬁelds such as decision-making, market analysis, and personalized services, querying large sets of uncertain data is becoming popular. This gives rise to two important challenges: ﬁrstly, how to ﬁnd the important data of interest from a multi-dimensional data set; and secondly, how to deal with the uncertainty in the data. The skyline query is one way of addressing the ﬁrst challenge. It works by only returning tuples of data that are not dominated by any other tuples. For example, when looking for the best hotels based on distance and price, a non-dominated tuple is a hotel which has the lowest price and/or shortest distance compared to all other hotels. This allows many tuples (hotels) which are less interesting (hotels which are further away and higher priced) to be pruned from the result set. In addition, uncertainty in data can occur in many real world sources, for example, network data, sensor data, survey data, etc. There has been much work on answering top-k queries [24] [51] [59] [37] [4] [57] [64], nearest neighbour queries [10] [29] [11], skyline queries on uncertain data [6] [46] etc. It is challenging to obtain accurate, reliable, and meaningful results from this data. The above discussion motivates the need for answering skyline queries on uncertain data. Currently, existing work on answering probabilistic skyline queries using the uncertain object model either require the user to set a threshold [46] or returns the skyline probabilities of every instance [6]. Neither approach is desirable 6

- Xem thêm -

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất