BEST PROBABILITY QUERIES ON
PROBABILISTIC DATABASES
Trieu Minh Nhut Le
A thesis submitted in total fulfilment
of the requirements for the degree of
Doctor of Philosophy
School of Engineering and Mathematical Sciences
Faculty of Science, Technology and Engineering
La Trobe University
Bundoora, Victoria 3086
Australia
January 2014
2
Acknowledgements
I have been given invaluable support by several people during PhD candidature. I
would like to take this opportunity to express my gratitude to them.
First and foremost, this thesis could not have come into existence without the
tremendous support and patient supervision from my supervisor. I would like to
thank Dr Jinli Cao for her endless support. I sincerely appreciate her valuable
advice and guidance to my research.
Second, I would like to thank Dr Zhen He at La Trobe University for providing
insightful ideas, guidance, and comments on my research. I have been fortunate
to collaborate with him on various work and have learnt precious skills in research
paper writing. He has treated me more like a friend than a student and has always
offered sound advice whenever I needed it.
Third, I would like to thank my co-supervisor Prof. Wenny Rahayu for her role
as a member of my Research Committee and for providing helpful feedback at every
stage of my Ph.D. I thank Ms. Michele Mooney for her careful proof reading of my
research papers and the final draft of this thesis.
Last but not least, I would like to express my gratitude and love to my family for
always being there and when I needed them most, and for supporting me throughout
my life. Especially, I would like to thank my mum, Tiet Thi Phan, for her continuous
love, care, and support. I would like to thank my lover, Tuyen Mong Do, for her
love, care and encouragement.
i
Abstract
This thesis focuses on answering probabilistic top-k and skyline queries on probabilistic data using the possible worlds semantics model. These are two of the most
important queries for decision support systems. Almost all other existing methods for answering queries on probabilistic data require the user to set a probability
threshold. However, it is difficult to set a threshold because if it is set too high,
important results may be lost, but if it is set too low, a lot of low quality results
may be returned.
In this thesis, novel approaches for answering probabilistic top-k and skyline
queries are proposed using the dominance principle as natural and effective methods to select results of queries with an acceptable number of answers, ensuring all
important answers are captured without the need to set a threshold.
There are three challenges to answering both probabilistic top-k and skyline
queries. The first challenge is to develop novel probabilistic top-k and skyline queries
using the dominance principle to return only the most interesting results. The second
challenge is to develop formulas based on probabilistic theory to directly calculate
the probabilities of the results without considering any possible worlds and to also
develop algorithms to effectively prune the search space. The third challenge is to
ensure that all the semantic properties of the probabilistic queries are covered.
The evaluations of the performance of the proposed approaches show that, firstly,
the results of the queries are not only very reasonable in size but also capture all
the important answers. Secondly, the proposed algorithms outperform the current
algorithms by accelerating the pruning search space, thereby reducing execution
time. Lastly, all the semantic properties of probabilistic queries are covered.
Statement of Authorship
Except where reference is made in the text of this thesis, this thesis contains no
material published elsewhere or extracted in whole or in part from a thesis submitted
for the award of any other degree or diploma.
No other person’s work has been used without due acknowledgment in the main
text of the thesis.
This thesis has not been submitted for the award of any degree or diploma in
any other tertiary institution.
Trieu Minh Nhut Le
Date:
iii
0. STATEMENT OF AUTHORSHIP
iv
External Refereed Publications
Trieu Minh Nhut Le, Jinli Cao, Top-k best probability queries on probabilistic data, Proceedings of the 17th International Conference on Database Systems
for Advanced Applications Volume Part II, DASFAA’12, Springer-Verlag, Berlin,
Heidelberg, 2012, pp. 116.
Trieu Minh Nhut Le, Jinli Cao, and Zhen He, Top-k best probability queries
and semantics ranking properties on probabilistic databases. Data & Knowledge
Engineering, 2013.
Trieu Minh Nhut Le, Jinli Cao, and Zhen He, Answering skyline queries on
probabilistic data using the dominance of probabilistic skyline tuples. The ACM
Transactions on Database Journal, under reviewed, August 2013.
v
0. EXTERNAL REFEREED PUBLICATIONS
vi
Contents
Acknowledgements
i
Statement of Authorship
iii
External Refereed Publications
v
List of Figures
xii
1 Introduction
1
1.1
Uncertain data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Querying probabilistic data . . . . . . . . . . . . . . . . . . . . . . .
2
1.2.1
Answering probabilistic top-k queries . . . . . . . . . . . . . .
2
1.2.2
Answering probabilistic skyline queries . . . . . . . . . . . . .
6
1.3
1.4
Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1
Contribution on answering probabilistic top-k queries . . . . . 11
1.3.2
Contribution on answering probabilistic skyline queries . . . . 12
Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Background
2.1
2.2
2.3
15
Probabilistic database models . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1
The uncertain object model . . . . . . . . . . . . . . . . . . . 15
2.1.2
The possible worlds semantics model . . . . . . . . . . . . . . 16
Queries on databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1
The top-k queries on data . . . . . . . . . . . . . . . . . . . . 19
2.2.2
Skyline queries on data . . . . . . . . . . . . . . . . . . . . . . 19
Summary of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
vii
CONTENTS
3 Existing work on probabilistic queries
3.1
3.2
3.3
25
Answering top-k queries on probabilistic data. . . . . . . . . . . . . . 25
3.1.1
The uncertain-top-k queries. . . . . . . . . . . . . . . . . . . . 26
3.1.2
The uncertain-k-rank . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3
The probabilistic threshold top-k query . . . . . . . . . . . . . 28
3.1.4
The global-top-k query. . . . . . . . . . . . . . . . . . . . . . . 29
3.1.5
The expected-score query . . . . . . . . . . . . . . . . . . . . 30
3.1.6
The expected-rank . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.7
The robust rank . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Evaluating probabilistic top-k queries on semantic properties. . . . . 33
3.2.1
Semantic properties for top-k probabilistic queries . . . . . . . 33
3.2.2
Analysing the answers to probabilistic top-k queries . . . . . . 34
Answering skyline queries on uncertain data . . . . . . . . . . . . . . 36
3.3.1
Answering skyline queries on incomplete data . . . . . . . . . 37
3.3.2
Answering probabilistic skyline queries using the uncertain
object model . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3
3.4
Computing all skyline probabilities for uncertain data . . . . . 39
Summary of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Answering top-k best probability queries
4.1
4.2
4.3
4.4
41
Motivation and our proposal. . . . . . . . . . . . . . . . . . . . . . . 41
4.1.1
Problem definition . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.3
Calculation of top-k probability . . . . . . . . . . . . . . . . . 46
4.1.4
Calculation of top-k probability with a generation rule . . . . 49
The top-k best probability query . . . . . . . . . . . . . . . . . . . . 52
4.2.1
Definition of the top-k best probability query . . . . . . . . . 52
4.2.2
Significance of top-k best probability query . . . . . . . . . . . 54
4.2.3
Finding top-k best probability and pruning rules . . . . . . . . 55
4.2.4
The top-k best probability algorithm . . . . . . . . . . . . . . 59
Semantics of top-k best probability query and other top-k queries. . . 60
4.3.1
Semantics of ranking properties . . . . . . . . . . . . . . . . . 60
4.3.2
Top-k queries satisfying semantic properties . . . . . . . . . . 62
Experimental study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.1
Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.4.2
Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
viii
CONTENTS
4.5
Summary of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Answering the best probabilistic skyline queries
5.1
5.2
5.3
5.4
5.5
73
Motivation and our proposal. . . . . . . . . . . . . . . . . . . . . . . 73
5.1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.1.2
Our proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Answering the bestpro-skyline query. . . . . . . . . . . . . . . . . . . 77
5.2.1
Naı̈ve definition of the probabilistic skyline query . . . . . . . 77
5.2.2
The bestpro-skyline query . . . . . . . . . . . . . . . . . . . . 78
Overview of a Naı̈ve solution compared to our solution. . . . . . . . . 79
5.3.1
Naı̈ve solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.2
Our solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Calculating skyline probability without enumerating all possible worlds 82
5.4.1
Formula for calculating the skyline probability of a tuple . . . 82
5.4.2
Handling generation rules . . . . . . . . . . . . . . . . . . . . 83
Bestpro-skyline Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5.1
Pruning the search using the Guard value of pivot tuples . . . 88
5.5.2
Nearest Neighbour-based bestpro-skyline (NN-BPS) algorithm 92
5.5.3
The skyline result-based bestpro-skyline (SKY-BPS) algorithm 97
5.5.4
Analysis of pruning performance of NN-BPS algorithm against
SKY-BPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.6
5.7
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6.1
Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.6.2
Algorithm setup . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6.3
Measurement metric . . . . . . . . . . . . . . . . . . . . . . . 107
5.6.4
Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.7.1
The results of bestpro-skyline query on real data . . . . . . . . 108
5.7.2
Comparing Naı̈ve-Threshold and NN-BPS using real data . . . 110
5.7.3
The pruning effectiveness of NN-BPS and SKY-BPS using the
synthetic data set. . . . . . . . . . . . . . . . . . . . . . . . . 111
5.7.4
5.8
Measuring pruning effectiveness as a function of time . . . . . 114
Summary of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
ix
CONTENTS
6 Conclusion
119
6.1 Thesis summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 Conclusion and Key Findings . . . . . . . . . . . . . . . . . . . . . . 121
6.3
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Bibliography
130
x
List of Figures
2.1
Probabilistic database . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2
Skyline tuples t1 and t3 of Table 1.4 . . . . . . . . . . . . . . . . . . . 21
2.3
Skyline process using the nearest neighbour algorithm . . . . . . . . . 21
3.1
A set of uncertain objects . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1
The answer to the PT-10 vs. thresholds. . . . . . . . . . . . . . . . . 67
4.2
Accessed tuples vs. k . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3
Tuples in answer vs. k . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4
Accessed tuples vs. data sets . . . . . . . . . . . . . . . . . . . . . . . 69
4.5
Tuples in answer vs. data sets . . . . . . . . . . . . . . . . . . . . . . 69
5.1
Probabilistic data of Table 5.1 graphed in attribute space . . . . . . . 81
5.2
Nearest neighbour ordered pivot tuple selection . . . . . . . . . . . . 82
5.3
Skyline ordered pivot tuple selection . . . . . . . . . . . . . . . . . . 82
5.4
Two heuristic algorithms based on our approach of using the Guard
value of pivot tuples to prune the search space.
. . . . . . . . . . . . 82
5.5
Probabilistic data of Table 5.4 graphed in attribute space . . . . . . . 84
5.6
Replacement of generation rules with compound tuples . . . . . . . . 86
5.7
An example of a probabilistic data set with generation rules . . . . . 87
5.8
The dominating tuple sets DORtp and DORte of tp and te , respectively. . 92
5.9
The region scanned by NN-BPS when processing the DOtp set . . . . 96
5.10 The region scanned by SKY-BPS when processing the DOtp set . . . 96
5.11 Example of search order for NN-BPS versus SKY-BPS . . . . . . . . 102
5.12 Total execution time of varying data set size of real data . . . . . . . 110
5.13 Size of result set of varying data set size of real data . . . . . . . . . . 110
5.14 Maximum tuple probability of 0.5 of varying data set size of synthetic
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xi
LIST OF FIGURES
5.15 Maximum tuple probability of 1.0 of varying data set size of synthetic
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.16 Maximum tuple probability of 0.5 of varying data set size of synthetic
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.17 Maximum tuple probability of 1.0 of varying data set size of synthetic
data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.18 Maximum tuple probability of 0.5 of varying number of dimensions . 115
5.19 Maximum tuple probability of 1.0 of varying number of dimensions . 115
5.20 Maximum tuple probability of 0.5 of varying number of dimensions . 115
5.21 Maximum tuple probability of 1.0 of varying number of dimensions . 115
5.22 Maximum tuple
tion of time . .
5.23 Maximum tuple
tion of time . .
probability of 0.5
. . . . . . . . . .
probability of 1.0
. . . . . . . . . .
of pruning
. . . . . .
of pruning
. . . . . .
effectiveness as a func. . . . . . . . . . . . . . 115
effectiveness as a func. . . . . . . . . . . . . . 115
5.24 Results of varying the maximum tuple probability . . . . . . . . . . . 116
xii
Chapter 1
Introduction
Many business intelligence and decision making applications derive reliable results
from querying uncertain data. A popular way of modeling uncertain data is to use
the possible worlds semantics model. The purpose of this thesis is to propose two
novel approaches to efficiently process and answer probabilistic top-k and skyline
queries on probabilistic data using the possible worlds semantics model.
1.1
Uncertain data
Uncertainty in data occurs in many real world sources, such as sensor data, network
data, survey data etc. Uncertain data can arise due to the limited accuracy of the
equipment, noise, a lack of information, and errors or delays in data transmission.
These data can be incomplete, and/or contain imprecise information. For example,
to control moving objects, sensors are used to detect the direction, speed, and/or
position of these objects. The data which is collected by this system is usually
uncertain data, because there may be errors in the transmission of the data which
can be caused by the high voltage environment, noise, or bad weather. Probability
information can be used to indicate the confidence in measurements taken from such
devices as cameras, radars, satellites etc.
In data integration, data from different sources are combined to provide an unified view of data to users. The result of the data integration process is usually
uncertain data due to missing functional annotations, non-uniform conceptualization, or an unstructured data repository of data mappings [16] [40] [21]. For example, integrating the life science database (biology, enterprises, government, libraries
database) is difficult due to the large number of diverse, interrelated data sources,
which is usually represented using uncertain data. Therefore, processing uncertain
1
1. INTRODUCTION
data is the largest obstacle faced by these applications [18] [16].
Data may be incomplete, may contain errors, or may be represented using probabilistic information. We call these types of data uncertain data. A number of
important applications on decision making and market surveillance need to process
uncertain data [4] [40] [34] [17] [42]. There has been much work on modeling uncertain data [45], [15], managing uncertain data [63], [4], ranking queries [24], [51],
[59], [37], [4], [57], [64], nearest neighbour queries [10], [29], [11], skyline queries
on uncertain data [6], [46] etc. It is challenging to obtain accurate, reliable, and
meaningful results from uncertain data.
1.2
Querying probabilistic data
The main requirement of all practical applications is that probabilistic queries on
probabilistic data must be processed efficiently and return accurate and reliable
results [24] [46] [11]. Top-k and skyline queries are important tools for data exploration on data mining, decision making, and market analysis applications [27] [53]
[9] [46] [24] [35]. In a relational database system, probabilistic queries such as skyline
or top-k queries select interesting and reliable results from the various alternatives
within the probabilistic data. Therefore, this thesis focuses on answering probabilistic top-k queries and probabilistic skyline queries on probabilistic data using the
possible worlds semantics model.
1.2.1
Answering probabilistic top-k queries
Top-k queries can be used to help investors make business decisions such as choosing projects which have the top-k highest profits. On probabilistic databases, top-k
queries can be answered by listing all possible worlds [24] [23] [50] [51] [2] [57]. A
possible world contains a number of tuples in the probabilistic data set. Each possible world can contain k tuples associated with a non-zero probability for existence.
Different possible worlds can contain different sets of k highest tuples. Therefore, it
is necessary to list all possible worlds to select the top-k answer.
For example, in business, investors often make decisions about various products
based on analysis, statistical data and mining data [42], which provide predictions
relating to potentially successful and unsuccessful projects. To analyse the market, investors firstly collect the historical statistical data, and then use the data to
predict future market trends with probabilistic predictions. This is known as prob-
2
1.2 Querying probabilistic data
abilistic data. For example, assume that the data in Table 1.1 have been collected
and analysed statistically, according to historical data resources [41]. Each tuple
represents an investment project of USD $100 to produce a specific product (Product ID). Investing in products based on their probabilities (Probability) will result
in an estimated amount of profit. In tuple t1 , a businessman invests USD $100 on
product A, and it has a 0.29 chance of obtaining a profit of USD $25.
Tuple
t1
t2
t3
t4
t5
t6
Product ID
A
B
E
B
C
E
Profit of USD $100 investment Probability
25
0.29
18
0.3
17
0.8
13
0.4
12
1.0
11
0.2
Table 1.1: Predicted Profits of USD $100 investment on Products
In the real world, when analysing historical data, predictions on future market
trends return two or more values per product with probabilities that the predictions
are correct. Therefore, some tuples in Table 1.1 have the same product ID with
different profits. In the probabilistic data model, these tuples are mutually exclusive,
and controlled by a set of rules (generation rule) [24] [23] [25] [48]. For example,
tuples t2 and t4 are projects that invest in product B with a 0.3 probability of
producing a USD $18 profit and 0.4 probability of producing a USD $13 profit,
respectively. In this case, if the prediction for tuple t2 is true, then the prediction
for tuple t4 will not be true. It is impossible for both profits to be true for the same
product ID. They are mutually exclusive predictions. In Table 1.1, the probabilistic
data are restricted by the exclusive rules R1 = t2 ⊕ t4 and R2 = t3 ⊕ t6 .
Top-k queries can be used to help investors make business decisions such as
choosing projects which have the top-2 highest profits. On probabilistic databases,
top-k queries can be answered by using the probability space that enumerates the list
of all possible worlds [24] [23] [50] [51] [2] [57]. A possible world contains a number of
tuples in the probabilistic data set. Each possible world has a non-zero probability
for existence and can contain k tuples with the highest profits. Different possible
worlds can contain different sets of k tuple answers. Therefore, it is necessary to list
all possible worlds of Table 1.1 to find the top-2 answers for the top-2 query on the
probabilistic database. Thus, Table 1.2 lists three dimensions: the possible world,
the probability of existence, and the top-2 tuples in each possible world.
3
1. INTRODUCTION
Possible world
Probability of existence Top-2 tuples
W1 = {t1 , t2 , t3 , t5 }
0.0696
t1 , t2
W2 = {t1 , t2 , t5 , t6 }
0.0174
t1 , t2
W3 = {t1 , t3 , t4 , t5 }
0.0928
t1 , t3
W4 = {t1 , t4 , t5 , t6 }
0.0232
t1 , t4
W5 = {t1 , t3 , t5 }
0.0696
t1 , t3
W6 = {t1 , t5 , t6 }
0.0174
t1 , t5
W7 = {t2 , t3 , t5 }
0.1704
t2 , t3
W8 = {t2 , t5 , t6 }
0.0426
t2 , t5
W9 = {t3 , t4 , t5 }
0.2272
t3 , t4
W10 = {t4 , t5 , t6 }
0.0568
t4 , t5
W11 = {t3 , t5 }
0.1704
t3 , t5
W12 = {t5 , t6 }
0.0426
t5 , t6
Table 1.2: List of all possible worlds and top-2 tuples of probabilistic data shown in
Table 1.1
According to Table 1.2, any tuple has a probability of being in the top-2. Therefore, Table 1.3 lists the tuples, profit, probability, and top-2 probability to analyse
the top-2 answers of probabilistic databases. The top-2 probability of a tuple is
aggregated by the sum of its probabilities of existence in the top-2 in Table 1.2.
Tuple
t1
t2
t3
t4
t5
t6
Profit Probability Top-2 probability
25
0.29
0.29
18
0.3
0.3
17
0.8
0.7304
13
0.4
0.3072
12
1.0
0.3298
11
0.2
0.0426
Table 1.3: List of top-2 probabilities of tuples of probabilistic data shown in Table
1.1
Problems with existing approaches. In previous research [24] [23], the top-k
answers are found using the probability threshold approach called PT-k. PT-k query
returns a set of tuples with top-k probabilities greater than or equal to the users’
threshold value. For example, the answer to the PT-2 query with threshold 0.3 in
the example listed in Table 1.3 is the set containing 4 tuples {t2 , t3 , t4 , t5 }. We have
identified three drawbacks with PT-k query. These are listed below:
• PT-k query may lose some important results. According to the PT-2 query,
tuple t1 (25, 0.29) is eliminated by the PT-2 algorithm because its top-2 probability is less than threshold 0.3. In this case, we recommend that tuple t1
4
1.2 Querying probabilistic data
should be in the result, the reason being that tuple t1 (25, 0.29) is not worse
than tuple t4 (13, 0.3072), when comparing both attributes of profit and top-2
probability. That is, t1 .profit (25) is significantly greater than t4 .profit (13) and
t1 .top-2 probability (0.29) is slightly less than t2 .top-2 probability (0.3072). In
business, investors may like to choose project t1 because they can earn nearly
double the profit compared to project t4 , while project t1 is only slightly riskier
than project t4 with a top-2 probability of 0.0172. Therefore, t1 should be included in the top-k answers.
• PT-k answers may contain some redundant tuples which should be eliminated
from the top-k results earlier in the query process. Referring to Table 1.3
again, tuples t4 and t5 should be eliminated immediately from the answer
because the values of both attributes, profit and top-2 probability in t4 (13,
0.3072) and t5 (12, 0.3298), are less than those in t3 (18, 0.7304). It is obvious
that investors will choose project t3 which is more dominant than projects t4
and t5 in both profit and top-k probability.
• It is difficult for users to choose a threshold for the PT-k method. The threshold is a crucial factor in enhancing the efficiency and effectiveness of PT-k
query [24] [23]. Users may not know much about probabilistic values and
probabilistic databases. Therefore, it may be difficult for users to find the
most suitable threshold initially. This, in turn, means they may need to waste
time using trial and error to find the most suitable threshold value.
Our Approach. In this thesis, we define a new probabilistic top-k query called
the “top-k best probability query” which overcomes the three drawbacks mentioned
above. The top-k best probability query takes both top-k profit and top-k probability
of all possible worlds into account when selecting the best top-k answers. Below,
we list the desirable properties of the top-k best probability queries, and how we
achieve these properties:
• Inclusion of important tuples and elimination of redundant tuples: users are
usually interested in projects with the highest profits. Therefore, the k-tuples
with the highest profits (scores) should be in the answer set of the top-k queries
on probabilistic data. For example, in Table 1.1, tuples t1 and t2 have the top2 best profits and therefore should be in the top-2 answer. In addition, we use
the dominance principal to address both selecting the important tuples (nondominated tuples) and eliminating the redundant tuples (dominated tuples).
5
1. INTRODUCTION
The tuples returned are the ones that are not dominated by any other tuples
in terms of both score and probability. The non-dominated tuples will have
“the best top-k probabilities”. For example, tuple t3 is a non-dominated tuple
which is added into the result because it has a greater top-2 probability than
all other tuples in the top-2-highest scores. Then, the rest of the tuples t4 , t5 , t6
in the data set are eliminated, because they are dominated by tuple t3 in both
profit and top-2 probability.
• Removal of the unclear threshold : The new method which combines the two
previous techniques will remove the need of the threshold for processing the
top-k queries on probabilistic databases.
Therefore, the set {t1 , t2 , t3 } is the answer to the top-2 query on Table 1.1. The
top-2 best probability query returns not only the tuples with the best top-2 ranking
scores but also the top-2 highest probabilities to the users.
1.2.2
Answering probabilistic skyline queries
In many increasingly important fields such as decision-making, market analysis, and
personalized services, querying large sets of uncertain data is becoming popular.
This gives rise to two important challenges: firstly, how to find the important data
of interest from a multi-dimensional data set; and secondly, how to deal with the
uncertainty in the data. The skyline query is one way of addressing the first challenge. It works by only returning tuples of data that are not dominated by any
other tuples. For example, when looking for the best hotels based on distance and
price, a non-dominated tuple is a hotel which has the lowest price and/or shortest
distance compared to all other hotels. This allows many tuples (hotels) which are
less interesting (hotels which are further away and higher priced) to be pruned from
the result set. In addition, uncertainty in data can occur in many real world sources,
for example, network data, sensor data, survey data, etc. There has been much work
on answering top-k queries [24] [51] [59] [37] [4] [57] [64], nearest neighbour queries
[10] [29] [11], skyline queries on uncertain data [6] [46] etc. It is challenging to obtain
accurate, reliable, and meaningful results from this data.
The above discussion motivates the need for answering skyline queries on uncertain data. Currently, existing work on answering probabilistic skyline queries
using the uncertain object model either require the user to set a threshold [46] or
returns the skyline probabilities of every instance [6]. Neither approach is desirable
6
- Xem thêm -