Tài liệu Distributed solutions in privacy preserving data mining

.PDF

130

293

nhattuvisu Báo vi phạm

Tải xuống 71

Mô tả:

BỘ GIÁO , mn DỤC VÀ ĐÀO TẠO BỘ QUỐC PHÒNG VIỆN KHOA HỌC VÀ CÔNG NGHỆ QUÂN SỰ ---------------FG--------------- LƯƠNG THẾ DŨNG DISTRIBUTED SOLUTIONS IN PRIVACY PRESERVING DATA MINING (Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn thông tin trong quá trình khai phá dữ liệu) LUẬN ÁN TIẾN SĨ TOÁN HỌC Hà Nội - 2011 BỘ GIÁO DỤC VÀ ĐÀO TẠO BỘ QUỐC PHÒNG VIỆN KHOA HỌC VÀ CÔNG NGHỆ QUÂN SỰ ---------------FG--------------- LƯƠNG THẾ DŨNG DISTRIBUTED SOLUTIONS IN PRIVACY PRESERVING DATA MINING (Nghiên cứu xây dựng một số giải pháp đảm bảo an toàn thông tin trong quá trình khai phá dữ liệu) Chuyên ngành: Bảo đảm toán học cho máy tính và hệ thống tính toán. Mã số : 62 46 35 01 LUẬN ÁN TIẾN SĨ TOÁN HỌC Người hướng dẫn khoa học: 1. GIÁO SƯ - TIẾN SĨ KHOA HỌC HỒ TÚ BẢO 2. PHÓ GIÁO SƯ - TIẾN SĨ BẠCH NHẬT HỒNG Hà Nội - 2011 Pledge I promise that this thesis is a presentation of my original research work. Any of the content was written based on the reliable references such as published papers in distinguished international conferences and journals, and books published by widely-known publishers. Results and discussions of the thesis are new, not previously published by any other authors. i Contents 1 2 INTRODUCTION 1 1.1 Privacy-preserving data mining: An overview . . . . . . . . . 1 1.2 Objectives and contributions . . . . . . . . . . . . . . . . . . 5 1.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . 12 METHODS FOR SECURE MULTI-PARTY COMPUTATION 13 2.1 Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Computational indistinguishability . . . . . . . . . . . 13 2.1.2 Secure multi-party computation . . . . . . . . . . . . . 14 2.2 3 Secure computation . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Secret sharing . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Secure sum computation . . . . . . . . . . . . . . . . . 16 2.2.3 Probabilistic public key cryptosystems . . . . . . . . . 17 2.2.4 Variant ElGamal Cryptosystem . . . . . . . . . . . . . 18 2.2.5 Oblivious polynomial evaluation . . . . . . . . . . . . 20 2.2.6 Secure scalar product computation . . . . . . . . . . . 21 2.2.7 Privately computing ln x . . . . . . . . . . . . . . . . . 22 PRIVACY PRESERVING FREQUENCY-BASED LEARNING IN 2PFD SETTING 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Privacy preserving frequency mining in 2PFD setting . . . . . 27 3.2.1 Problem formulation . . . . . . . . . . . . . . . . . . 27 3.2.2 Deﬁnition of privacy . . . . . . . . . . . . . . . . . . . 29 3.2.3 Frequency mining protocol . . . . . . . . . . . . . . . 30 ii 3.3 3.4 3.5 4 3.2.4 Correctness Analysis . . . . . . . . . . . . . . . . . . . 32 3.2.5 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 34 3.2.6 Eﬃciency of frequency mining protocol 37 . . . . . . . . Privacy Preserving Frequency-based Learning in 2PFD Setting 38 3.3.1 Naive Bayes learning problem in 2PFD setting . . . . 38 3.3.2 Naive Bayes learning Protocol . . . . . . . . . . . . . . 40 3.3.3 Correctness and privacy analysis . . . . . . . . . . . . 42 3.3.4 Eﬃciency of naive Bayes learning protocol . . . . . . . 42 An improvement of frequency mining protocol . . . . . . . . . 44 3.4.1 Improved frequency mining protocol . . . . . . . . . . 44 3.4.2 Protocol Analysis . . . . . . . . . . . . . . . . . . . . . 45 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 ENHANCING PRIVACY FOR FREQUENT ITEMSET MINING IN VERTICALLY ... 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Association rules and frequent itemset . . . . . . . . . 51 4.2.2 Frequent itmeset identifying in vertically distributed data 52 4.3 Computational and privacy model . . . . . . . . . . . . . . . 53 4.4 Support count preserving protocol . . . . . . . . . . . . . . . 54 4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Protocol design . . . . . . . . . . . . . . . . . . . . . . 56 4.4.3 Correctness Analysis . . . . . . . . . . . . . . . . . . . 57 4.4.4 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 59 4.4.5 Performance analysis . . . . . . . . . . . . . . . . . . . 61 Support count computation-based protocol . . . . . . . . . . 64 4.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.2 Protocol Design . . . . . . . . . . . . . . . . . . . . . . 65 4.5.3 Correctness Analysis . . . . . . . . . . . . . . . . . . . 65 4.5.4 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 67 4.5.5 Performance analysis . . . . . . . . . . . . . . . . . . . 68 Using binary tree communication structure . . . . . . . . . . 69 4.5 4.6 iii 5 4.7 Privacy-preserving distributed Apriori algorithm . . . . . . . 70 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 PRIVACY PRESERVING CLUSTERING 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Privacy preserving clustering for the multi-party distributed data 76 5.4 5.5 6 5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.2 Private multi-party mean computation . . . . . . . . . 78 5.3.3 Privacy preserving multi-party clustering protocol . . 80 Privacy preserving clustering without disclosing cluster centers 82 5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.2 Privacy preserving two-party clustering protocol . . . 85 5.4.3 Secure mean sharing . . . . . . . . . . . . . . . . . . . 87 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 PRIVACY PRESERVING OUTLIER DETECTION 91 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2 Technical preliminaries . . . . . . . . . . . . . . . . . . . . . . 92 6.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . 92 6.2.2 Linear transformation . . . . . . . . . . . . . . . . . . 93 6.2.3 Privacy model . . . . . . . . . . . . . . . . . . . . . . 94 6.2.4 Private matrix product sharing . . . . . . . . . . . . . 95 Protocols for the horizontally distributed data . . . . . . . . . 95 6.3.1 Two-party protocol . . . . . . . . . . . . . . . . . . . . 97 6.3.2 Multi-party protocol . . . . . . . . . . . . . . . . . . . 100 6.3 6.4 Protocol for two-party vertically distributed data . . . . . . . 101 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 SUMMARY 107 Publication List 110 Bibliography 111 iv List of Phrases Abbreviation Full name PPDM Privacy Preserving Data Mining k-NN k-nearest neighbor EM Expectation-maximization SMC Secure Multiparty Computation DDH Decisional Diﬃe-Hellman PMPS Private Matrices Product Sharing SSP Secure Scalar Product OPE Oblivious polynomial evaluation ICA Independent Component Analysis 2PFD 2-part fully distributed setting FD fully distributed setting c ≡ computational indistinguishability v List of Tables 4.1 The communication cost . . . . . . . . . . . . . . . . . . . . 62 4.2 The complexity of the support count preserving protocol . . . 63 4.3 The parties’s time for the support count preserving protocol 64 4.4 The communication cost . . . . . . . . . . . . . . . . . . . . 68 4.5 The complexity of the support count computation protocol . 69 4.6 The parties’s time for the support count computation protocol 70 6.1 The parties’s computational time for the horizontally distributed data 105 6.2 The parties’s computational time for the vertically distributed data 105 vi List of Figures 3.1 Frequency mining protocol . . . . . . . . . . . . . . . . . . . . 33 3.2 The time used by the miner for computing the frequency f . 38 3.3 Privacy preserving protocol of naive Bayes learning . . . . . . 41 3.4 The computational time for the ﬁrst phase and the third phrase 43 3.5 The time for computing the key values in the ﬁrst phase . . 43 3.6 The time for computing the frequency f in third phrase . . . 44 3.7 Improved frequency mining protocol . . . . . . . . . . . . . . 47 4.1 Support count preserving protocol. . . . . . . . . . . . . . . . 58 4.2 The support count computation protocol. . . . . . . . . . . . 66 4.3 Privacy-preserving distributed Apriori protocol . . . . . . . . 72 5.1 Privacy preserving multi-party mean computation . . . . . . 79 5.2 Privacy preserving multi-party clustering protocol . . . . . . 81 5.3 Privacy preserving two-party clustering . . . . . . . . . . . . 86 5.4 Secure mean sharing . . . . . . . . . . . . . . . . . . . . . . . 89 6.1 Private matrix product sharing (PMPS). . . . . . . . . . . . . 96 6.2 Protocol for two-party horizontally distributed data. . . . . . 98 6.3 Protocol for multi-party horizontally distributed data. . . . . 101 6.4 Protocol for two-party vertically distributed data. . . . . . . . 103 vii Chapter 1 INTRODUCTION 1.1. Privacy-preserving data mining: An overview Data mining plays an important role in the current world and provides us a powerful tool to eﬃciently discover valuable information from large databases [25]. However, the process of mining data can result in a violation of privacy, therefore, issues of privacy preservation in data mining are receiving more and more attention from the this community [52]. As a result, there are a large number of studies has been produced on the topic of privacy-preserving data mining (PPDM) [72]. These studies deal with the problem of learning data mining models from the databases, while protecting data privacy at the level of individual records or the level of organizations. Basically, there are three major problems in PPDM [8]. First, the organizations such as government agencies wish to publish their data for researchers and even community. However, they want to preserve the data privacy, for example, highly sensitive ﬁnancial and health private data. Second, a group of the organizations (or parties) wishes to together obtain the mining result on their joint data without disclosing each party’s privacy information. Third, a miner wishes to collect data or obtain the data mining models from the individual users, while preserving privacy of each user. Consequently, PPDM can be formed into three following areas depending on the models of information sharing. Privacy-preserving data publishing: The model of this research consists of only an organization, is the trusted data holder. This organization wishes to publish its data to the miner or the research community such that the anonymized data are useful for the data mining applications. For example, some hospitals collect records from their patients for the some required 1 medical services. These hospital can be the trusted data holder, however the patients may not trust the hospital when they send their data to the miner. When we publish our anonymized data there are many evidences show that the publishing data may make privacy breaches via some attacks. One of them is called re-identiﬁcation and as showed in [61]. For example, there are 87% of the American population has characteristics that allows identifying them uniquely based on several public attributes, namely zip code, date of birth, and sex. Consequently, privacy preserving data publishing has received many attentions in recent years that aims to prevent re-identiﬁcation attack, while preserving the useful information for data mining applications from the released data. The general technique called k-anonymity [51], [82],[6], the goal here is to make the property that obtains the protection of released data to ﬁght against re-identiﬁcation possibility of the released data. Consider a private data table, where the data have been removed explicit identiﬁers (e.g., SSN and Name). However, values of other released attributes, such as ZIP, Date of birth, Marital status, and Sex can also appear in some external sources that may still joint with the individual users’ identities. If some combinations of values for these attributes occur uniquely or rarely, then from observing the data one can determine the identity of the user or deduce a limited set that consists of user. The goal of k-anonymity is that every tuple in the released private table is indistinguishability from at least other k users. Privacy-preserving distributed data mining: This research area aims to develop distributed data mining algorithms without accessing original data [33, 79, 35, 68, 80, 40]. Diﬀerent from privacy preserving data publishing, each study in privacy-preserving distributed data mining is often to solve a speciﬁc data mining task. The model of this area usually consists of several parties instead, each party has one private data set. The general purpose is to enable the parties for mining cooperatively on their joint data sets without revealing private information to other participating parties. Here, the way the data is distributed on parties also plays an important role in the solved problem. Generally, data could be distributed into many 2 parts either vertically or horizontally. In horizontally distribution, a data set is distributed into several parties. Every party has data records with the same set of attributes. For example, the customer databases union of the diﬀerent banks. Typically, banks have diﬀerent services for their clients as a savings account, choice of credit card, stock investments, etc. Assuming that banks wish to predict who are safe customers, who may be risk ones and may be frauds. Gathering all kinds of ﬁnancial data about their customers and their transactions can help them in the above predictions, thus they prevent huge ﬁnancial losses. Using reasonable techniques for mining on the gathered data can generalize over these datasets and identify possible risks for future cases or transactions. More speciﬁcally, when a customer Nam goes to Bank A to apply for a loan to buy a car. He needs to provide the necessary information to the Bank A. An expert system of Bank A can use k-NN algorithm to classify Nam as either a risk or safe customer. If this system only uses the database of the bank A, it could happen that Bank A does not have enough customers that are close to Nam. Therefore, the system may produce a wrong classiﬁcation. For example, Nam is one safe customer but the system recognized him as a risk customer. Consequently, the Bank A has a loss of proﬁt. It is clear that the mining on the larger database can result in the more accurate classiﬁcation. Thus, the classiﬁcation on the joint databases of bank A and other banks might give a more accurate result and Nam could have been classiﬁed as safe. However, the problem is that privacy restrictions would not allow the banks to access to each other’s databases. So, privacy-preserving distributed data mining can help this problem. This scenario is a typical case of socalled horizontally partitioned data. In the context of privacy-preserving data mining, banks do not need to reveal their databases to each other. They can still apply k-NN classiﬁcation to the joint databases of banks while preserving each bank’s privacy information. In vertically distribution, a data set is distributed into some parties. Every party owns a vertical part of every record in the database (it holds records for a subset of the attributes). For example, ﬁnancial transaction 3 information is collected by banks, while the tax information for everyone is collected by the IRS. In [71] Jaideep Vaidya et al. show an illustrative example of two vertical distributed databases as well, one contains medical records of people while another contains cell phone information for the same set of people. Mining the joint global database might obtain information like Cell phones with Li/Ion batteries lead to brain tumors in diabetics. Privacy-preserving user data mining: This research involves a scenario in which a data miner surveys a large number of users to learn some data mining results based on the user data or collects the user data while the sensitive attributes of these users need to be protected [74, 77, 19]. In this scenario, each user only maintains a data record. This can be thought of a horizontally partitioned database in which each transaction is owned by a diﬀerent user, and it is called fully distributed setting (FD) as well. Unlike privacy-preserving data publishing, the miner is diﬀerent from the publisher that he is un-trusted, thus he could be an attacker who attempts to identify some sensitive information from the user data. For example, Du et al. [19] studied to build decision tree on private data. In this study, a miner wants to collect data from users, and form a central database, then to conduct data mining on this database. Thus, he gives a survey containing some questions, each user is requited to answer those questions and sends back the answers. However, the survey contains some sensitive questions, and user may not feels comfortable to disclose their answers. Thus, the problem is that how could the miner obtain the mining model without learning sensitive information about the users. One of requirement for the methods in this area is that there are not any interactions between users, and each user only communicate to the data miner. However, we are still able to ensure that nothing about the sensitive data beyond the desired results is revealed to the data miner. 4 1.2. Objectives and contributions Up to now, there are many available solutions for solving the issues in PPDM. The quality of each solution is evaluated based on the three basic characteristics: privacy degree, accuracy, and eﬃciency. But the problem here is that each solution was only used in a particular distributed scenario or in a concrete data mining algorithm. Although some of them can be applied for more than one scenario or algorithm but their accuracy is lower than acceptable requirement. Other solutions reach the accuracy, however, their privacy is poor. In addition, it is easily to see that the lack of PPDM solutions for various practical context as well as well-known data mining techniques. In this thesis, we aim at solving some issues in PPDM as follows: 1. To introduce a new scenario for privacy-preserving user data mining and ﬁnd a good privacy solution for a family of frequency-based learning algorithms in this scenario. 2. To develop novel privacy-preserving techniques for popular data mining algorithms such as association rule mining and clustering methods. 3. To present a technique to design protocols for privacy- preserving multivariate outlier detection in both horizontally and vertically distributed data models. The developed solutions will be evaluated in terms of the degree of privacy protection, correctness, usability to the real life applications, eﬃciency and scalability. The contributions of this thesis is to provide solutions for four problems in PPDM. Each problem has an independent statement to the others, but they share a common interpretation that given a data set being distributed into several parties (or users), our task is to mine knowledge from all parties’ joint data while preserving privacy of each party. The diﬀerence among those problems lies in various distributed data models (scenarios), and various proposed functions to keep the privacy information for parties. Summarizing, our contributions in this thesis are as follows. 5 • The ﬁrst work (Chapter 3), we propose a new scenario for privacypreserving user data mining called 2-part fully distributed setting (2PFD) and ﬁnd solution for a family of frequency-based learning algorithms in 2PFD setting. In 2PFD, the dataset is distributed across a large number of users in which each record is owned by two diﬀerent users, one user only knows the values for a subset of attributes and the other knows the values for the remaining attributes. A miner aims to learn, for example, classiﬁcation rules on their data, while preserving each users privacy. In this work we develop a cryptographic solution for frequency-based learning methods in 2PFD. The crucial step in the proposed solution is the privacy-preserving computation of frequencies of a tuple of values in the users data, which can ensure each users privacy without loss of accuracy. We illustrate the applicability of the method by using it to build the privacy preserving protocol for the naive Bayes classiﬁer learning, and brieﬂy address the solution in other applications. Experimental results show that our protocol is eﬃcient • The second contribution of this thesis (Chapter 4) is the novel protocols for privacy-preserving frequent itemset mining in vertically distributed data. These protocols allow a group of parties cooperatively mine frequent itemsets in distributed setting without revealing each party’s portion of the data to the other. The important security property of our protocols is better than the previous protocols’ one in the way that we achieve the full privacy protection for each party. This property does not require the existence of any of trusted parties. In addition, no collusion of parties can make privacy breaches. • For third work (Chapter 5), we present the expectation maximization mixture model clustering method for distributed data that preserves privacy for data of participating parties. Firstly, privacy preserving EM-based clustering method for multi-party distributed data proposed. Unlike the existing method, our method does not reveal sum results of numerator and denominator in the secure computation for the pa6 rameters of EM algorithm, therefore, the proposed method is more secure and it allows the number of participating parties to be arbitrary. Secondly, we propose the better method for the case in which the dataset is horizontally partitioned into only two parts, this method allows computing covariance matrices and ﬁnal results without revealing the private information and the means. To solve this one, we have presented a protocol based on the oblivious polynomial evaluation and the secure scalar product for addressing some problems, such as the means, covariance matrix and posterior probability computation. The approach of paper allows two or many parties to cooperatively conduct clustering on their joint data sets without disclosing each partys private data to the other. • In fourth work (chapter 6), we study some parties - each has a private data set - want to conduct the outlier detection on their joint data set, but none of them want to disclose its private data to the other parties. We propose a linear transformation technique to design protocols of secure multivariate outlier detection in both horizontally and vertically distributed data models. While diﬀerent from the most of previous techniques in a privacy preserving fashion for distance-based outliers detection. Our focus is other non-distance based techniques for detecting outliers in statistics. 1.3. Related works Recently a lot of solutions have been proposed for PPDM. Those solutions can be categorized into two main approaches: Secure multiparty computation (SMC) approach and randomization approach. The basic idea of randomization approach is to perturb the original (private) dataset and the result is released for data analysis. The perturbation has to ensure that original individual data values cannot be recovered, while preserving the utility of the data for statistical properties. Thus it allows 7 patterns in the original data to be mined. There are two main perturbation techniques: random transformation and randomization. First transforms each data value (record) into a random value (record) of the same domain with the original data in ways that preserve certain statistics, but hide real values [21, 4, 19, 3]. Second adds noise to data to prevent discovery of the real. Randomization has to ensure that given the distribution of the noise added to the data, and the randomized data set, it can be reconstructed the distribution (but not actual data values) of the data set [1, 36, 16]. The typical example is the algorithm of Agrawal-Srikant [2], which values of an attribute are discretized into intervals and each original value is assigned to an interval. Then, the original data distribution is reconstructed by a Bayesian approach, and based on the reconstructed distribution the decision trees can be induced. Many other distribution reconstruction methods have also been introduced. In [1], Agrawal et al. developed an approach based on Expectation Maximization that also gave a better deﬁnition of privacy, and an improved algorithm. Evmievski et al. [21] used a similar technique for association rules mining. Polat et al. proposed a privacy preserving collaborative ﬁltering method using randomized techniques [53], etc. Although perturbation techniques are very eﬃcient, their use generally involves a tradeoﬀ between privacy and accuracy, if we require the more privacy, the miner loses more accuracy in the data mining results, and viceversa. Even the very technique that allow us to reconstruct distributions also reveal information about the original data values. For example, consider the case of randomizing age attribute. In principle, there are no drivers under 18 in the general distribution of age. Thus, assume that randomization is implemented by adding noise randomly chosen from the range [-10,10]. Although the reconstructed distribution does not show us any true age value, it only give the age of a driver be 40 years old that corresponds to the true age in the range [30 50]. However if an age value in the randomized set is 7, we know that no drivers are under the age of 18, so the driver whose age is given as 7 in the randomized data must be 18 years old in the original data. Thus, some works has been done to measure the privacy of 8 randomization techniques for the purpose that they must be used carefully to obtain the desired privacy. Kargupta et al. [36] formally analyze the privacy of randomization techniques and show that many cases reveal privacy information. Evmievski et al. [20] show how to limit privacy breaches when using the randomization technique for privacy preserving data mining. Many privacy preserving data mining algorithm based on SMC has proposed as well. They can be described as a computational process where a group of parties computes a function based on private inputs, but neither party wants to disclose its own input to any other party. The secure multiparty computation framework was developed by Goldreich[24]. In this framework, multiparty protocols can fall into either the semi-honest model or malicious adversary model. In the semi-honest model, the parties are assumed that follows the protocol rules, but after the execution of the protocol has completed, the parties still try to learn additional information by analyzing the messages they received during the execution of the protocol. In the malicious adversary model, it is assumed that the parties can execute some arbitrary operations to damage to other parties. Thus, the protocol design in this model are much more diﬃcult than in the semi-honest model. However, in current the semi-honest model is usually used for the context of privacy preserving data mining. The formal deﬁnitions of SMC was stated in [24]. The secure multi-party computation problem was ﬁrst proposed by Yao [78] where he gave the method to solve Yao’s Millionaire problem that allows comparing the worth of two millionaires without revealing any privacy information of each people. According to theoretical studies of Goldreich, the general SMC problem can be solved by the circuit evaluation method. However, using this solution is not practical in terms of the eﬃciency. Therefore, ﬁnding eﬃcient problems speciﬁc solutions was seen as an important research direction. In the recent years, many speciﬁc solutions were introduced for the diﬀerent research areas such as information retrieval, computational geometry, statistical analysis, etc. [17, 10, 70]. Randomization approaches [21, 4, 19, 3, 1, 36, 16] can be used in the 9 fully distributed scenario where a data miner wants to obtain classiﬁcation models from the data of large sets of users. These users can simply randomize their data and then submit their randomized data to the miner who can later reconstruct some useful information. However for obtaining strong privacy without loss of accuracy, in [74, 77] the SMC techniques have proposed. The key idea of these techniques is a private frequency computation method that allows a data miner to compute frequencies of values or tuples in the FD setting, while preserving privacy of each user’s data. In Chapter 3, we proposed a SMC solution which allows the miner to learn frequency-based models in 2PFD setting. Note that in this setting, each user may only know some values of the tuple but not all. Therefore, the above mentioned cryptographic approaches can not be used in 2PFD setting. In the FD setting, other solutions based on k-anonymization of user’s data have been proposed in [83, 77]. The advantage of these solutions is that they do not depend on the underlying data mining tasks, because the anonymous data can be used for various data mining tasks without disclosing privacy. However, these solutions are inapplicable in 2PFD setting, because the miner can not link two anonymous parts of one object with each other. The SMC approaches are usually are used for privacy-preserving distributed data mining as well, where data are distributed across several parties. Thus, the privacy property of privacy-preserving distributed data mining algorithms is quantiﬁed by the privacy deﬁnition of SMC, where each party involved in the privacy-preserving distributed protocols is only allowed to learn the desired data mining models without any other information. Generally, each protocol has to be designed for speciﬁc work for the reason of eﬃciency and privacy. Currently speciﬁc privacy-preserving distributed protocols have been proposed to address diﬀerent data mining problems across distributed databases, e.g., in [70, 63, 18], they developed a privacy preserving classiﬁcation protocols from the vertically distributed data based on a secure scalar product method that consists of privacy preserving protocols for learning naive Bayes classiﬁcation, association rules and decision trees. In [34], the privacy preserving naive Bayes lassiﬁcation was addressed for hori10 zontally distributed data by computing the secure sum of all local frequencies of participating parties. Our work in Chapter 4 is to present the frequent itemset mining protocols for vertically partitioned data. Distributed association rules/itemsets mining has been addressed for both vertically partitioned data and horizontally partitioned data [33, 79, 35, 68, 80]. However, to the best of our knowledge, these protocols preserves the privacy of each party and only resist the collusion at most n − 2 corrupted parties. Our protocols for privacy preserving frequent mining involving multiple parties can protect the privacy of each party and gainst the collusion, up to n − 1 corrupted parties. For the related work with privacy preserving distributed clustering. Recently, privacy preserving clustering problems have also been studied by many authors. In [49] and [47], the authors focused on diﬀerent transformation techniques that enable the data owner to share the data with the other party who will cluster it. Clifton and Vaidya proposed a secure multiparty computation of k-means algorithm on vertically partitioned data [66]. In [29], the authors proposed a solution for privacy preserving clustering on horizontally partitioned data, where they primarily focused on hierarchical clustering methods that can both discover clusters of arbitrary shapes and deal with diﬀerent data types. In [59], Kruger et al. proposed a privacy preserving, distributed k-means protocol on horizontally partitioned data that the key step is privacy preserving of cluster means. At each iteration of the algorithm, only means are revealed to parties without other things. But, revealing means might allow the parties to learn some extra information of each other. To our knowledge, there is so far only one secure method for the expectation maximization (EM) mixture model from horizontally distributed sources [40] based on secure sum computation. However, this method requires at least three participating parties. Because the global model is a sum of local models, in case only two parties, which often happens in practice, each party could compute other party’s local model by subtracting its local model from the global model. The aim of this work in chapter 5 is ﬁrstly to develop a more general protocol which allows the number of par11

- Xem thêm -

Tài liệu Distributed solutions in privacy preserving data mining

Tài liệu liên quan

Tài liệu vừa đăng

Tài liệu xem nhiều nhất