- Số trang:
**130**| - Loại file:
**PDF**| - Lượt xem:
**85**| - Lượt tải:
**0**

nhattuvisu

Đã đăng **27125** tài liệu

Mô tả:

BỘ GIÁO
, mn DỤC VÀ ĐÀO TẠO
BỘ QUỐC PHÒNG
VIỆN KHOA HỌC VÀ CÔNG NGHỆ QUÂN SỰ
---------------FG---------------
LƯƠNG THẾ DŨNG
DISTRIBUTED SOLUTIONS IN PRIVACY
PRESERVING DATA MINING
(Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn
thông tin trong quá trình khai phá dữ liệu)
LUẬN ÁN TIẾN SĨ TOÁN HỌC
Hà Nội - 2011
BỘ GIÁO DỤC VÀ ĐÀO TẠO
BỘ QUỐC PHÒNG
VIỆN KHOA HỌC VÀ CÔNG NGHỆ QUÂN SỰ
---------------FG---------------
LƯƠNG THẾ DŨNG
DISTRIBUTED SOLUTIONS IN PRIVACY
PRESERVING DATA MINING
(Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn
thông tin trong quá trình khai phá dữ liệu)
Chuyên ngành: Bảo đảm toán học cho máy tính và hệ thống tính toán.
Mã số
: 62 46 35 01
LUẬN ÁN TIẾN SĨ TOÁN HỌC
Người hướng dẫn khoa học:
1. GIÁO SƯ - TIẾN SĨ KHOA HỌC HỒ TÚ BẢO
2. PHÓ GIÁO SƯ - TIẾN SĨ BẠCH NHẬT HỒNG
Hà Nội - 2011
Pledge
I promise that this thesis is a presentation of my original research work.
Any of the content was written based on the reliable references such as
published papers in distinguished international conferences and journals, and
books published by widely-known publishers. Results and discussions of the
thesis are new, not previously published by any other authors.
i
Contents
1
2
INTRODUCTION
1
1.1
Privacy-preserving data mining: An overview . . . . . . . . .
1
1.2
Objectives and contributions . . . . . . . . . . . . . . . . . .
5
1.3
Related works . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.4
Organization of thesis . . . . . . . . . . . . . . . . . . . . . .
12
METHODS FOR SECURE MULTI-PARTY COMPUTATION
13
2.1
Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.1.1
Computational indistinguishability . . . . . . . . . . .
13
2.1.2
Secure multi-party computation . . . . . . . . . . . . .
14
2.2
3
Secure computation
. . . . . . . . . . . . . . . . . . . . . . .
15
2.2.1
Secret sharing . . . . . . . . . . . . . . . . . . . . . . .
15
2.2.2
Secure sum computation . . . . . . . . . . . . . . . . .
16
2.2.3
Probabilistic public key cryptosystems . . . . . . . . .
17
2.2.4
Variant ElGamal Cryptosystem . . . . . . . . . . . . .
18
2.2.5
Oblivious polynomial evaluation . . . . . . . . . . . .
20
2.2.6
Secure scalar product computation . . . . . . . . . . .
21
2.2.7
Privately computing ln x . . . . . . . . . . . . . . . . .
22
PRIVACY PRESERVING FREQUENCY-BASED LEARNING IN 2PFD SETTING
24
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.2
Privacy preserving frequency mining in 2PFD setting . . . . .
27
3.2.1
Problem formulation
. . . . . . . . . . . . . . . . . .
27
3.2.2
Deﬁnition of privacy . . . . . . . . . . . . . . . . . . .
29
3.2.3
Frequency mining protocol . . . . . . . . . . . . . . .
30
ii
3.3
3.4
3.5
4
3.2.4
Correctness Analysis . . . . . . . . . . . . . . . . . . .
32
3.2.5
Privacy Analysis . . . . . . . . . . . . . . . . . . . . .
34
3.2.6
Eﬃciency of frequency mining protocol
37
. . . . . . . .
Privacy Preserving Frequency-based Learning in 2PFD Setting 38
3.3.1
Naive Bayes learning problem in 2PFD setting . . . .
38
3.3.2
Naive Bayes learning Protocol . . . . . . . . . . . . . .
40
3.3.3
Correctness and privacy analysis . . . . . . . . . . . .
42
3.3.4
Eﬃciency of naive Bayes learning protocol . . . . . . .
42
An improvement of frequency mining protocol . . . . . . . . .
44
3.4.1
Improved frequency mining protocol . . . . . . . . . .
44
3.4.2
Protocol Analysis . . . . . . . . . . . . . . . . . . . . .
45
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
ENHANCING PRIVACY FOR FREQUENT ITEMSET MINING IN VERTICALLY ... 49
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.2
Problem formulation . . . . . . . . . . . . . . . . . . . . . . .
51
4.2.1
Association rules and frequent itemset . . . . . . . . .
51
4.2.2
Frequent itmeset identifying in vertically distributed data 52
4.3
Computational and privacy model . . . . . . . . . . . . . . .
53
4.4
Support count preserving protocol . . . . . . . . . . . . . . .
54
4.4.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.4.2
Protocol design . . . . . . . . . . . . . . . . . . . . . .
56
4.4.3
Correctness Analysis . . . . . . . . . . . . . . . . . . .
57
4.4.4
Privacy Analysis . . . . . . . . . . . . . . . . . . . . .
59
4.4.5
Performance analysis . . . . . . . . . . . . . . . . . . .
61
Support count computation-based protocol . . . . . . . . . .
64
4.5.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.5.2
Protocol Design . . . . . . . . . . . . . . . . . . . . . .
65
4.5.3
Correctness Analysis . . . . . . . . . . . . . . . . . . .
65
4.5.4
Privacy Analysis . . . . . . . . . . . . . . . . . . . . .
67
4.5.5
Performance analysis . . . . . . . . . . . . . . . . . . .
68
Using binary tree communication structure . . . . . . . . . .
69
4.5
4.6
iii
5
4.7
Privacy-preserving distributed Apriori algorithm . . . . . . .
70
4.8
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
PRIVACY PRESERVING CLUSTERING
73
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
5.2
Problem statement . . . . . . . . . . . . . . . . . . . . . . . .
74
5.3
Privacy preserving clustering for the multi-party distributed data 76
5.4
5.5
6
5.3.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.3.2
Private multi-party mean computation . . . . . . . . .
78
5.3.3
Privacy preserving multi-party clustering protocol . .
80
Privacy preserving clustering without disclosing cluster centers 82
5.4.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . .
83
5.4.2
Privacy preserving two-party clustering protocol . . .
85
5.4.3
Secure mean sharing . . . . . . . . . . . . . . . . . . .
87
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
PRIVACY PRESERVING OUTLIER DETECTION
91
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
6.2
Technical preliminaries . . . . . . . . . . . . . . . . . . . . . .
92
6.2.1
Problem statement . . . . . . . . . . . . . . . . . . . .
92
6.2.2
Linear transformation . . . . . . . . . . . . . . . . . .
93
6.2.3
Privacy model . . . . . . . . . . . . . . . . . . . . . .
94
6.2.4
Private matrix product sharing . . . . . . . . . . . . .
95
Protocols for the horizontally distributed data . . . . . . . . .
95
6.3.1
Two-party protocol . . . . . . . . . . . . . . . . . . . .
97
6.3.2
Multi-party protocol . . . . . . . . . . . . . . . . . . . 100
6.3
6.4
Protocol for two-party vertically distributed data . . . . . . . 101
6.5
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.6
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
SUMMARY
107
Publication List
110
Bibliography
111
iv
List of Phrases
Abbreviation
Full name
PPDM
Privacy Preserving Data Mining
k-NN
k-nearest neighbor
EM
Expectation-maximization
SMC
Secure Multiparty Computation
DDH
Decisional Diﬃe-Hellman
PMPS
Private Matrices Product Sharing
SSP
Secure Scalar Product
OPE
Oblivious polynomial evaluation
ICA
Independent Component Analysis
2PFD
2-part fully distributed setting
FD
fully distributed setting
c
≡
computational indistinguishability
v
List of Tables
4.1
The communication cost
. . . . . . . . . . . . . . . . . . . .
62
4.2
The complexity of the support count preserving protocol . . .
63
4.3
The parties’s time for the support count preserving protocol
64
4.4
The communication cost
. . . . . . . . . . . . . . . . . . . .
68
4.5
The complexity of the support count computation protocol .
69
4.6
The parties’s time for the support count computation protocol
70
6.1
The parties’s computational time for the horizontally distributed data 105
6.2
The parties’s computational time for the vertically distributed data 105
vi
List of Figures
3.1
Frequency mining protocol . . . . . . . . . . . . . . . . . . . .
33
3.2
The time used by the miner for computing the frequency f
.
38
3.3
Privacy preserving protocol of naive Bayes learning . . . . . .
41
3.4
The computational time for the ﬁrst phase and the third phrase 43
3.5
The time for computing the key values in the ﬁrst phase
. .
43
3.6
The time for computing the frequency f in third phrase . . .
44
3.7
Improved frequency mining protocol . . . . . . . . . . . . . .
47
4.1
Support count preserving protocol. . . . . . . . . . . . . . . .
58
4.2
The support count computation protocol. . . . . . . . . . . .
66
4.3
Privacy-preserving distributed Apriori protocol . . . . . . . .
72
5.1
Privacy preserving multi-party mean computation . . . . . .
79
5.2
Privacy preserving multi-party clustering protocol . . . . . .
81
5.3
Privacy preserving two-party clustering . . . . . . . . . . . .
86
5.4
Secure mean sharing . . . . . . . . . . . . . . . . . . . . . . .
89
6.1
Private matrix product sharing (PMPS). . . . . . . . . . . . .
96
6.2
Protocol for two-party horizontally distributed data. . . . . .
98
6.3
Protocol for multi-party horizontally distributed data. . . . . 101
6.4
Protocol for two-party vertically distributed data. . . . . . . . 103
vii
Chapter 1
INTRODUCTION
1.1.
Privacy-preserving data mining: An overview
Data mining plays an important role in the current world and provides
us a powerful tool to eﬃciently discover valuable information from large
databases [25]. However, the process of mining data can result in a violation of privacy, therefore, issues of privacy preservation in data mining are
receiving more and more attention from the this community [52]. As a result, there are a large number of studies has been produced on the topic of
privacy-preserving data mining (PPDM) [72]. These studies deal with the
problem of learning data mining models from the databases, while protecting
data privacy at the level of individual records or the level of organizations.
Basically, there are three major problems in PPDM [8]. First, the organizations such as government agencies wish to publish their data for researchers
and even community. However, they want to preserve the data privacy, for
example, highly sensitive ﬁnancial and health private data. Second, a group
of the organizations (or parties) wishes to together obtain the mining result on their joint data without disclosing each party’s privacy information.
Third, a miner wishes to collect data or obtain the data mining models from
the individual users, while preserving privacy of each user. Consequently,
PPDM can be formed into three following areas depending on the models of
information sharing.
Privacy-preserving data publishing: The model of this research consists of only an organization, is the trusted data holder. This organization
wishes to publish its data to the miner or the research community such that
the anonymized data are useful for the data mining applications. For example, some hospitals collect records from their patients for the some required
1
medical services. These hospital can be the trusted data holder, however the
patients may not trust the hospital when they send their data to the miner.
When we publish our anonymized data there are many evidences show
that the publishing data may make privacy breaches via some attacks. One of
them is called re-identiﬁcation and as showed in [61]. For example, there are
87% of the American population has characteristics that allows identifying
them uniquely based on several public attributes, namely zip code, date of
birth, and sex. Consequently, privacy preserving data publishing has received
many attentions in recent years that aims to prevent re-identiﬁcation attack,
while preserving the useful information for data mining applications from
the released data.
The general technique called k-anonymity [51], [82],[6], the goal here is
to make the property that obtains the protection of released data to ﬁght
against re-identiﬁcation possibility of the released data. Consider a private
data table, where the data have been removed explicit identiﬁers (e.g., SSN
and Name). However, values of other released attributes, such as ZIP, Date
of birth, Marital status, and Sex can also appear in some external sources
that may still joint with the individual users’ identities. If some combinations
of values for these attributes occur uniquely or rarely, then from observing
the data one can determine the identity of the user or deduce a limited set
that consists of user. The goal of k-anonymity is that every tuple in the
released private table is indistinguishability from at least other k users.
Privacy-preserving distributed data mining: This research area
aims to develop distributed data mining algorithms without accessing original data [33, 79, 35, 68, 80, 40]. Diﬀerent from privacy preserving data
publishing, each study in privacy-preserving distributed data mining is often
to solve a speciﬁc data mining task. The model of this area usually consists
of several parties instead, each party has one private data set. The general purpose is to enable the parties for mining cooperatively on their joint
data sets without revealing private information to other participating parties. Here, the way the data is distributed on parties also plays an important
role in the solved problem. Generally, data could be distributed into many
2
parts either vertically or horizontally.
In horizontally distribution, a data set is distributed into several parties.
Every party has data records with the same set of attributes. For example,
the customer databases union of the diﬀerent banks. Typically, banks have
diﬀerent services for their clients as a savings account, choice of credit card,
stock investments, etc. Assuming that banks wish to predict who are safe
customers, who may be risk ones and may be frauds. Gathering all kinds of
ﬁnancial data about their customers and their transactions can help them
in the above predictions, thus they prevent huge ﬁnancial losses. Using
reasonable techniques for mining on the gathered data can generalize over
these datasets and identify possible risks for future cases or transactions.
More speciﬁcally, when a customer Nam goes to Bank A to apply for a loan
to buy a car. He needs to provide the necessary information to the Bank A.
An expert system of Bank A can use k-NN algorithm to classify Nam as either
a risk or safe customer. If this system only uses the database of the bank
A, it could happen that Bank A does not have enough customers that are
close to Nam. Therefore, the system may produce a wrong classiﬁcation. For
example, Nam is one safe customer but the system recognized him as a risk
customer. Consequently, the Bank A has a loss of proﬁt. It is clear that the
mining on the larger database can result in the more accurate classiﬁcation.
Thus, the classiﬁcation on the joint databases of bank A and other banks
might give a more accurate result and Nam could have been classiﬁed as
safe. However, the problem is that privacy restrictions would not allow the
banks to access to each other’s databases. So, privacy-preserving distributed
data mining can help this problem. This scenario is a typical case of socalled horizontally partitioned data. In the context of privacy-preserving
data mining, banks do not need to reveal their databases to each other.
They can still apply k-NN classiﬁcation to the joint databases of banks while
preserving each bank’s privacy information.
In vertically distribution, a data set is distributed into some parties.
Every party owns a vertical part of every record in the database (it holds
records for a subset of the attributes). For example, ﬁnancial transaction
3
information is collected by banks, while the tax information for everyone
is collected by the IRS. In [71] Jaideep Vaidya et al. show an illustrative
example of two vertical distributed databases as well, one contains medical
records of people while another contains cell phone information for the same
set of people. Mining the joint global database might obtain information like
Cell phones with Li/Ion batteries lead to brain tumors in diabetics.
Privacy-preserving user data mining: This research involves a scenario in which a data miner surveys a large number of users to learn some
data mining results based on the user data or collects the user data while
the sensitive attributes of these users need to be protected [74, 77, 19]. In
this scenario, each user only maintains a data record. This can be thought
of a horizontally partitioned database in which each transaction is owned by
a diﬀerent user, and it is called fully distributed setting (FD) as well. Unlike
privacy-preserving data publishing, the miner is diﬀerent from the publisher
that he is un-trusted, thus he could be an attacker who attempts to identify
some sensitive information from the user data. For example, Du et al. [19]
studied to build decision tree on private data. In this study, a miner wants
to collect data from users, and form a central database, then to conduct data
mining on this database. Thus, he gives a survey containing some questions,
each user is requited to answer those questions and sends back the answers.
However, the survey contains some sensitive questions, and user may not
feels comfortable to disclose their answers. Thus, the problem is that how
could the miner obtain the mining model without learning sensitive information about the users. One of requirement for the methods in this area is that
there are not any interactions between users, and each user only communicate to the data miner. However, we are still able to ensure that nothing
about the sensitive data beyond the desired results is revealed to the data
miner.
4
1.2.
Objectives and contributions
Up to now, there are many available solutions for solving the issues in
PPDM. The quality of each solution is evaluated based on the three basic
characteristics: privacy degree, accuracy, and eﬃciency. But the problem
here is that each solution was only used in a particular distributed scenario
or in a concrete data mining algorithm. Although some of them can be
applied for more than one scenario or algorithm but their accuracy is lower
than acceptable requirement. Other solutions reach the accuracy, however,
their privacy is poor. In addition, it is easily to see that the lack of PPDM
solutions for various practical context as well as well-known data mining
techniques. In this thesis, we aim at solving some issues in PPDM as follows:
1. To introduce a new scenario for privacy-preserving user data mining
and ﬁnd a good privacy solution for a family of frequency-based learning
algorithms in this scenario.
2. To develop novel privacy-preserving techniques for popular data mining algorithms such as association rule mining and clustering methods.
3. To present a technique to design protocols for privacy- preserving
multivariate outlier detection in both horizontally and vertically distributed
data models.
The developed solutions will be evaluated in terms of the degree of privacy
protection, correctness, usability to the real life applications, eﬃciency and
scalability.
The contributions of this thesis is to provide solutions for four problems
in PPDM. Each problem has an independent statement to the others, but
they share a common interpretation that given a data set being distributed
into several parties (or users), our task is to mine knowledge from all parties’
joint data while preserving privacy of each party. The diﬀerence among those
problems lies in various distributed data models (scenarios), and various
proposed functions to keep the privacy information for parties. Summarizing,
our contributions in this thesis are as follows.
5
• The ﬁrst work (Chapter 3), we propose a new scenario for privacypreserving user data mining called 2-part fully distributed setting (2PFD)
and ﬁnd solution for a family of frequency-based learning algorithms
in 2PFD setting. In 2PFD, the dataset is distributed across a large
number of users in which each record is owned by two diﬀerent users,
one user only knows the values for a subset of attributes and the other
knows the values for the remaining attributes. A miner aims to learn,
for example, classiﬁcation rules on their data, while preserving each
users privacy. In this work we develop a cryptographic solution for
frequency-based learning methods in 2PFD. The crucial step in the
proposed solution is the privacy-preserving computation of frequencies
of a tuple of values in the users data, which can ensure each users privacy without loss of accuracy. We illustrate the applicability of the
method by using it to build the privacy preserving protocol for the
naive Bayes classiﬁer learning, and brieﬂy address the solution in other
applications. Experimental results show that our protocol is eﬃcient
• The second contribution of this thesis (Chapter 4) is the novel protocols
for privacy-preserving frequent itemset mining in vertically distributed
data. These protocols allow a group of parties cooperatively mine
frequent itemsets in distributed setting without revealing each party’s
portion of the data to the other. The important security property of
our protocols is better than the previous protocols’ one in the way that
we achieve the full privacy protection for each party. This property
does not require the existence of any of trusted parties. In addition,
no collusion of parties can make privacy breaches.
• For third work (Chapter 5), we present the expectation maximization
mixture model clustering method for distributed data that preserves
privacy for data of participating parties. Firstly, privacy preserving
EM-based clustering method for multi-party distributed data proposed.
Unlike the existing method, our method does not reveal sum results
of numerator and denominator in the secure computation for the pa6
rameters of EM algorithm, therefore, the proposed method is more
secure and it allows the number of participating parties to be arbitrary. Secondly, we propose the better method for the case in which
the dataset is horizontally partitioned into only two parts, this method
allows computing covariance matrices and ﬁnal results without revealing the private information and the means. To solve this one, we have
presented a protocol based on the oblivious polynomial evaluation and
the secure scalar product for addressing some problems, such as the
means, covariance matrix and posterior probability computation. The
approach of paper allows two or many parties to cooperatively conduct clustering on their joint data sets without disclosing each partys
private data to the other.
• In fourth work (chapter 6), we study some parties - each has a private
data set - want to conduct the outlier detection on their joint data
set, but none of them want to disclose its private data to the other
parties. We propose a linear transformation technique to design protocols of secure multivariate outlier detection in both horizontally and
vertically distributed data models. While diﬀerent from the most of
previous techniques in a privacy preserving fashion for distance-based
outliers detection. Our focus is other non-distance based techniques
for detecting outliers in statistics.
1.3.
Related works
Recently a lot of solutions have been proposed for PPDM. Those solutions
can be categorized into two main approaches: Secure multiparty computation
(SMC) approach and randomization approach.
The basic idea of randomization approach is to perturb the original (private) dataset and the result is released for data analysis. The perturbation
has to ensure that original individual data values cannot be recovered, while
preserving the utility of the data for statistical properties. Thus it allows
7
patterns in the original data to be mined. There are two main perturbation
techniques: random transformation and randomization. First transforms
each data value (record) into a random value (record) of the same domain
with the original data in ways that preserve certain statistics, but hide real
values [21, 4, 19, 3]. Second adds noise to data to prevent discovery of the
real. Randomization has to ensure that given the distribution of the noise
added to the data, and the randomized data set, it can be reconstructed the
distribution (but not actual data values) of the data set [1, 36, 16].
The typical example is the algorithm of Agrawal-Srikant [2], which values
of an attribute are discretized into intervals and each original value is assigned
to an interval. Then, the original data distribution is reconstructed by a
Bayesian approach, and based on the reconstructed distribution the decision
trees can be induced. Many other distribution reconstruction methods have
also been introduced. In [1], Agrawal et al. developed an approach based
on Expectation Maximization that also gave a better deﬁnition of privacy,
and an improved algorithm. Evmievski et al. [21] used a similar technique
for association rules mining. Polat et al. proposed a privacy preserving
collaborative ﬁltering method using randomized techniques [53], etc.
Although perturbation techniques are very eﬃcient, their use generally
involves a tradeoﬀ between privacy and accuracy, if we require the more
privacy, the miner loses more accuracy in the data mining results, and viceversa. Even the very technique that allow us to reconstruct distributions also
reveal information about the original data values. For example, consider the
case of randomizing age attribute. In principle, there are no drivers under
18 in the general distribution of age. Thus, assume that randomization
is implemented by adding noise randomly chosen from the range [-10,10].
Although the reconstructed distribution does not show us any true age value,
it only give the age of a driver be 40 years old that corresponds to the
true age in the range [30 50]. However if an age value in the randomized
set is 7, we know that no drivers are under the age of 18, so the driver
whose age is given as 7 in the randomized data must be 18 years old in the
original data. Thus, some works has been done to measure the privacy of
8
randomization techniques for the purpose that they must be used carefully
to obtain the desired privacy. Kargupta et al. [36] formally analyze the
privacy of randomization techniques and show that many cases reveal privacy
information. Evmievski et al. [20] show how to limit privacy breaches when
using the randomization technique for privacy preserving data mining.
Many privacy preserving data mining algorithm based on SMC has proposed as well. They can be described as a computational process where
a group of parties computes a function based on private inputs, but neither party wants to disclose its own input to any other party. The secure
multiparty computation framework was developed by Goldreich[24]. In this
framework, multiparty protocols can fall into either the semi-honest model
or malicious adversary model. In the semi-honest model, the parties are
assumed that follows the protocol rules, but after the execution of the protocol has completed, the parties still try to learn additional information by
analyzing the messages they received during the execution of the protocol.
In the malicious adversary model, it is assumed that the parties can execute
some arbitrary operations to damage to other parties. Thus, the protocol
design in this model are much more diﬃcult than in the semi-honest model.
However, in current the semi-honest model is usually used for the context of
privacy preserving data mining. The formal deﬁnitions of SMC was stated
in [24].
The secure multi-party computation problem was ﬁrst proposed by Yao
[78] where he gave the method to solve Yao’s Millionaire problem that allows
comparing the worth of two millionaires without revealing any privacy information of each people. According to theoretical studies of Goldreich, the
general SMC problem can be solved by the circuit evaluation method. However, using this solution is not practical in terms of the eﬃciency. Therefore,
ﬁnding eﬃcient problems speciﬁc solutions was seen as an important research
direction. In the recent years, many speciﬁc solutions were introduced for
the diﬀerent research areas such as information retrieval, computational geometry, statistical analysis, etc. [17, 10, 70].
Randomization approaches [21, 4, 19, 3, 1, 36, 16] can be used in the
9
fully distributed scenario where a data miner wants to obtain classiﬁcation
models from the data of large sets of users. These users can simply randomize their data and then submit their randomized data to the miner who can
later reconstruct some useful information. However for obtaining strong privacy without loss of accuracy, in [74, 77] the SMC techniques have proposed.
The key idea of these techniques is a private frequency computation method
that allows a data miner to compute frequencies of values or tuples in the
FD setting, while preserving privacy of each user’s data. In Chapter 3, we
proposed a SMC solution which allows the miner to learn frequency-based
models in 2PFD setting. Note that in this setting, each user may only know
some values of the tuple but not all. Therefore, the above mentioned cryptographic approaches can not be used in 2PFD setting. In the FD setting,
other solutions based on k-anonymization of user’s data have been proposed
in [83, 77]. The advantage of these solutions is that they do not depend on
the underlying data mining tasks, because the anonymous data can be used
for various data mining tasks without disclosing privacy. However, these solutions are inapplicable in 2PFD setting, because the miner can not link two
anonymous parts of one object with each other.
The SMC approaches are usually are used for privacy-preserving distributed data mining as well, where data are distributed across several parties. Thus, the privacy property of privacy-preserving distributed data mining algorithms is quantiﬁed by the privacy deﬁnition of SMC, where each
party involved in the privacy-preserving distributed protocols is only allowed
to learn the desired data mining models without any other information. Generally, each protocol has to be designed for speciﬁc work for the reason of
eﬃciency and privacy. Currently speciﬁc privacy-preserving distributed protocols have been proposed to address diﬀerent data mining problems across
distributed databases, e.g., in [70, 63, 18], they developed a privacy preserving classiﬁcation protocols from the vertically distributed data based on a
secure scalar product method that consists of privacy preserving protocols
for learning naive Bayes classiﬁcation, association rules and decision trees. In
[34], the privacy preserving naive Bayes lassiﬁcation was addressed for hori10
zontally distributed data by computing the secure sum of all local frequencies
of participating parties. Our work in Chapter 4 is to present the frequent
itemset mining protocols for vertically partitioned data. Distributed association rules/itemsets mining has been addressed for both vertically partitioned
data and horizontally partitioned data [33, 79, 35, 68, 80]. However, to the
best of our knowledge, these protocols preserves the privacy of each party
and only resist the collusion at most n − 2 corrupted parties. Our protocols
for privacy preserving frequent mining involving multiple parties can protect
the privacy of each party and gainst the collusion, up to n − 1 corrupted
parties.
For the related work with privacy preserving distributed clustering. Recently, privacy preserving clustering problems have also been studied by
many authors. In [49] and [47], the authors focused on diﬀerent transformation techniques that enable the data owner to share the data with the
other party who will cluster it. Clifton and Vaidya proposed a secure multiparty computation of k-means algorithm on vertically partitioned data [66].
In [29], the authors proposed a solution for privacy preserving clustering on
horizontally partitioned data, where they primarily focused on hierarchical
clustering methods that can both discover clusters of arbitrary shapes and
deal with diﬀerent data types. In [59], Kruger et al. proposed a privacy preserving, distributed k-means protocol on horizontally partitioned data that
the key step is privacy preserving of cluster means. At each iteration of the
algorithm, only means are revealed to parties without other things. But, revealing means might allow the parties to learn some extra information of each
other. To our knowledge, there is so far only one secure method for the expectation maximization (EM) mixture model from horizontally distributed
sources [40] based on secure sum computation. However, this method requires at least three participating parties. Because the global model is a
sum of local models, in case only two parties, which often happens in practice, each party could compute other party’s local model by subtracting its
local model from the global model. The aim of this work in chapter 5 is
ﬁrstly to develop a more general protocol which allows the number of par11

- Xem thêm -