о ѵ ч .^
!N ỏ 广
An Improvement Solution for M ultiple
A ttribute Information Searching Based
On Structured P2P Networks
N
g
u
y
e
n
T
h
a
n
h
D
a
t
Faculty of Information Technology
Hanoi University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Doctor Nguyen Hoai Sổn
A thesis s u b m itte d in fu lfillm e n t o f the re q u ire m e n ts fo r th e degree o f
M a ste r o f In fo rm a tio n Technology
Decem ber, 2009
Đ A I H Ọ C QUO C G IA HA NỘI
TRUNG TẨM THÔNG TIN THU VIỀN
A -
L<^
ỗb
______
Table of Contents
Acknowledgements
ii
A b stra ct
1
1
In tr o d u c tio n
2
1.1
Overview and M otivation.....................................................................
2
1.2
Coưim unicat.iọn netw ork m o d e l s ....................................... ..............................
f>
1.3 P2P network m o d e ls ...........................................................................
9
1.3.1
P2P N e tw o rk ...........................................................................
9
1.3.2 P2P Network Models................................................................ 10
1.3.2.1 Unstructured. P2P N e tw o rk ....................................... 10
Hybrid P2P N etw ork ..................................................
Structured P2 P N etw ork ............................................
12
14
1.4 DHT-based P ro to c o l...........................................................................
15
1.3.2.2
1.3.2.3
1.4.1 Distributed Haah Table - DHT
...............................................
15
1.4.2 CHORD Protocol......................................................................
16
1.4.2.1
Topology...................................................................... 17
1.4.2.2
Lookup and I n s e r t .....................................................
1.4.2.3
Join and Leave.......................................................... 20
1.4.2.4
Stabilization and F a ilu re .................
1.5 Summary
2
, .
18
............... 21
............................................................................................ 22
R e la te d W o rk s
2Л
24
INS/Twine :Information distribution based on Attribute-Value trees . 24
2.1.1 S o lu tio n .................................................................................... 24
2.1.2 System architecture................................................................... 26
2.1.3 System architecture................................................................... 26
2.1.4 S u m m a ry ................................................................................. 27
iii
—
2.2
_____ TABLE OF CONTENTS
CDS: Irifonnation D istribution Based O il Load Balancing M a trix
2.2.1
. . 28
S o lu tio n ....................................................................................28
2.2.2 System architccturc.................................................................. 28
2.2.2.1
Registering a content n a m e ...........................................29
2.2.3 System architecture...................................................................29
2.2.4
2.2.3.1
Query re solutio n....................................................... 30
2.2.3.2
S u m m a ry .......................................................................... 30
Load Balancing M a trix (LB M )
2.2.4.1
2.2.5
2.3
The structure of L B M ...............................................31
System architecture.................................................................. 31
2.2.5 .1
2.2.6
.................................................... 30
LB M management mechanism....................................... 32
S u m m a ry ................................................................................. 33
Data In d e x in g .......................................................................................
2.3.1
33
S o lu tio n ............................................................................................. 33
2.3/2 Insert a f i l o .............................................................................. 36
2.3.3 Lookups.................................................................................... 37
2.3.4 S u m m a ry ................................................................................. 38
2
ậ.4
SMAV: Searching - M ultiple-attribu te V a lu e ..........................................38
2.4.1
S o lu tio n .................................................................................... 38
2.4.2 Distribution of information c o n te n t......................................... 40
2.4.3 Information content name query...............................................42
2.4.4 S u m m a ry ................................................................................. 42
A n Im p ro v e m e n t S o lu tio n fo r M u ltip le - a ttr ib u te In fo r m a tio n Search
in g on S tru c tu re d P 2 P N e tw o rk
3.1
44
I d e a .................................................................................................................45
3.2 Three Levels MappingModel
3.2.1
Overview
.............................................................. 46
...................................................................... ..
3.2.2 Thrce-Levclb Sub-key Mapping S to rin g ................................... 48
3.2.3 Distribution of information c o n te n t......................................... 50
3/2.4
Inform ation q u e ry ............................................................................. 54
3.2.5
S u m m a r y .......................................................................................... 57
3.3 The Dynamic ThresholdV alue s............................................................57
3.3.1 A formula of threshold v a lu e s .................................................. 57
3.3.2 Adjusted Distribution A lg o rith m ............................................ 59
3.3.3 Updating Threshold Value Periodically....................................61
46
TABLE OF CON TENTS
3.1
4
_____
____
_________
v
3.3.4 Adjusted Lookup A lg o rith m ..................................................... 62
Summary ............................................................................................. 64
S im u la tio n s and E v a lu a tio n s
65
4 .1
Q ualitative
65
4.2
Simulation D e s c rip tio n ................................................................................ 67
4.3
Evaluation Based On S im u la tio n s ..............................................................68
E valuations.................................................................
4.3.1 Load balancing......................................................................... 68
4.3.2 Distribution of informationc o n te n t........................................... 71
4.3.3 Routing Performance........................................................ ..
5 Conclusions and Future W ork
74
77
List of Figures
1.1
Client/Server network m o d e l......................................................................
8
1.2
Peer-To-Peer network model with 5 peers...........................................
10
1.3
Locating resources in a Gnutella-likc P 2P e n v iro n m e n t...................... 11
1.4
An example of H yb rid P2P M o d e l............................................................
13
L5
15
1.6
Distribution data progress based on D H T ...........................................
Chord's key space with 23 points . . ........... .....................................
17
1.7
Chord's com ponents......................... .. ..............................................
17
1.8
Lookup progress of Chord*s protocol with key 5 4 ............................... 19
1.9
Joining phase of a node in C H O R D protocol.................. ......................... 21
2.J
Meta string and A V T rc o .....................................................................25
2.2
Architecture of IN S /T w in S y s te m ............................................................ 26
2.3
S p litting a rcbourcc description into s t r a n d s ......................................... 27
2.4
The architecture of CDS system ......................................................... 29
2.5
An example of distribution of AVs in nodes........................................30
2.6
The structure of Load Balancing Matrix for { 이,
V i } ......................... 31
2.7
An example of described d a t a ...................................................................34
2.8
Sample File Queries .............................................................................34
2.9
Mappings between q u e rie s ..................................................................35
2.10 Query mapping for three descriptors
................................................ 36
2.11 An example of mappings t r e e ............................................................36
2.12 An example of a path of queries......................................................... 37
2.13 Key - sub-key m a p p in g s.....................................................................40
3.1
The number of hop levels of a content name in pure S M A V .............
3.2
Mappings are created fr 이ฑ key k ị ............................................................ 49
3.3
The generation of distributed keys from a content n a m e ......................53
3.4
Block diagram of query progress of improved S M A V ......................... 55
49
LIS T OF FIG U R E S
v ii
3.5
3.6
An example of query progress with 4 common k e y s ......................... 56
Combining of common keys and a uncommon k e y ............................ 61
4.1
The distribu tion of ЛѴ pairs in content n a m e s ...................................... 68
4.2
Number of inform ation contents stored in each of 5000 n o d e s ............ 69
4.3
The number o f queries is processeci by each of 5000 nodes.................. 70
4.4
4.5
Load balancing among nodes...............................................................70
Mappings stored in every node............................................................ 72
4.Г)
Mappings is created by CNs in a. DSMAV b. S M A V .......................72
4.7
4.8
4.9
The number of keys stored in every n o d e ...........................................73
Level-к Sub-keys are created by three solutions ............................... 73
Logicai hop count required for each query ........................................ 74
4.10 The maximum number of hop level of three s o lu tio n s ...................... 75
4.11 The number of successful queries ...................................................... 75
List of Tables
1Л
Comparison: Client/Server vs. P 2 P .........................................................
9
1.2
Definition of variables for node ท. using m -bit identifiers
..................
18
2.1
Mapping tabic between distributed key and content n a m e s.............. 41
2.2
Mapping table between distributed key and s u b -k e y s ....................... 41
2.3
Mapping table between distributed key and uncommon k e y s ........... 42
3.1
Mapping Table between distributed key and content n a m e s .................51
3.2
Mapping table between level-2 sub-keys and distributed k e y s ............ 53
3.3
Mapping Table between keys and co n te n ts........................................60
viii
Abstract
Conventional information searching engines such as Google, Yahoo, and Wikipedia...
support only Keyword-based searching on websites. They cannot search information
in various kinds of resources such as personal devices like Laptop, PDA} Cell Phone
or sharing files in P2P Network.
Besides, DHT-based P2P networks such as Chord, CAN, Pastry can achieve
cxact (!ucry (i.e. query of an exact key) with characteristic of scalability, efficicncy
and fault-tolerate. However, in the Cítóe of complex queries such as range query or
multiple-attribute query, pure DHT is not efficient since lots of query messages must
be sent.
In this thesis, we focus our intentions on m ultiple-attribute query on DHTbatícd P2P network. The big problem here is the unbalance among nodes due to the
appearance of common attribute/value pairs (AV pairs) in content names. The main
idea of our method is to lim it number of content items, which assigned to an ID by
creating sub-IDs from multiple AV pairs if those AV pairs appear in lots of content
names, to threshold value of each node. To reduce query cost, our system also keeps
the mapping between an ID and its sub IDs if existed in the node responsible for the
ID. Moreover, we store only mappings, which are created in distribution progress,
to 2 nodes. Our method can achieve both efficiency and a good degree of load
balancing even when the distribution of AV pairs is skewed. Our simulation result
shows the efficiency of our solution in respects of lookup time and the degree of load
balancing
1
Chapter 1
Introduction
1.1
Overview and Motivation
With the unprecedented growth of information technology, today wecan see that
information is appearing in everywhere. Information might be found in various kinds
of resource« «uch as personal dcvices like Laptops, PDAs, Cell phones..., websites in
the Internet, sharing files in P2P network 1.,.
From the explosion of information, there are more and more information search
ing demands in somewhere. Every day we need lots of information to communicate
and work efficiently and easily. For instance, we search for weather forecast information before a trip or a picnic. We also search for information of the latast news of
the day,refercncc« of a product to buy, information of market priccs, etc... In lots of
cases, if we seize desired information quickly and exactly, we might have more suc
cessful opportunities in communication and work. Therefore,information searching
is a necessary demand in nowadays information age. The emergence of new applica
tions and services will require an efficiency information searching system which can
realize complex query on contcnt names in a sealable manner (พ . Adjie-Winoto &
Liliey. 1999;Carzaniga Sz Wolf, 2001; Foster & Tuecke,2002).
There are many large systems to allow searching information such as conventional
search engines: Google, Yahoo. Amazon, eBay, Wikipedia... Google engine allows
users to search information based on keywords on Internet. This engine can link to
billions of websites to search information. Information of each website is described
by keywords and then they are processed and stored in servers of Google.
Conventional search systems often use Client/Server model where servers pro
9
1.1. Overview and M otivation
vide searching services to clients. However, Client/Server model have some disad
vantages. Firstly, it has limitation in scalability. Servers are made with high cost
because it need a very big capacity of processing and storing. Secondly, each server
may be a single point of failure. When server goes down,operations will be ceascd.
Moreover, as the big number of simultaneous client requests to a given server increaiies. the server can become overloaded. When a big amount of clients join to the
network, traffic congestion on the network has also been an issue.
Rcccntly, the appcarancc of Pccr-to-Pccr (P2P) network model has attracted the
interest of lots of people, P2P with their decentralized control, self-organization and
adaptation have emerged as a significant social and technical phenomenon over the
last year. Unlike Client/Server model, P2P networks aim to aggregate largo num
bers of computers that join and leave the network frequently. In pure P2P systems,
individual computers communicate dircctly with each other and bharc information
and resources without using dedicated servers. For example, they provide infrastruc
ture for communities that share CPU cycles (e.g., SETI@Home, Entropia) and/or
storage spacc (e.g., Napster (Idit Kcidar, 2006; Napster, 1999). FrccNet, Gnutella
(Gnutella, 1999)). or that support collaborative environments (Groove).
In P2P networks, all clients provide resources, including bandwidth, storage
spac(î, and computing power. If there are more and more many nodes to join to
the svstem, the total capacity of the systcn would be more and more increase. This
is not true of Client/Server network model with a fixed set of servers, in which
adding more clients could mean slower data transfer for all users. The distributed
nature of P2P networks also increases robustness in case of failures by replicating
data over multiple peers, and by enabling peers to find the data without relying on
a centralized index server. In the latter ease, there i« 110 single point of failure ill
the system.
Information searching on P2P network is attended in recent years. Advantages
of P2P network model allows us to construct information searching systems with
capabilities of scalability and fault-toleratc. Bccausc of the whole of data of system
are distributed to all nodes; each node is responsible for a portion of data and to
take part, in search progrevss. The Gnutella network (Gnutella, 1999) supports to
share and search file«. It searches data by flooding messages to the whole network.
Nevertheless? Gnutella network requires high overhead; the search may be failed
because a query may be not routed to the node is responsible for desired information.
Hence, it leads to search information inefficiently. eDonkey (Weikum, 2002) network
C h ap ter 1. Introduction
suits to share and scarch big files such video files, fu ll music albuma and software
based on some nodes (which are called indexing servers) that allow users to locate
files within the network. However, maintaining indexing servers cause ฯcentral point
of failure” • If indexing servers ccasc operates, a query may be not sent to the nodes
whose data are indexed in these servers. In general, scalability, efficiency and faulttolerance in message routing and inform ation query on P2P networks are essential
problems th a t must be resolved.
The adoption of D istributed Hash Table (D H T ) such as C AN (ร. Ratnaäamv
к Karp,
2001),
Chord (Stoica et al..
2001).
Pastry (A. Rowstroib
2001),etc
offers
a promising solution for routing problems in P2P network. A DHT-based network
constructs a structure of a v irtu a l key spacc where cach node is responsible for a
portion of key space7 and keys are used as destinations to route messages. Ill the
case of Chord, it can route a message to a node responsible for a destination kcv in
O(logN) hops while each node only needs to maintain O(logN) links to other nodes,
where N is the to ta l of nodes in the network. Some typical DHT-based network are
BitTorrcnt, the Storm botnct. the Kad network, YaCy. W ith above organization.
DHT-hasod network meet, a problem. I t only achieve exact queries. In DHT-based
network, inform ation contcnt. is represented by a key from its description. I f users
want to scarch this inform ation content, they need to send w ith complete query to
the network. In fact, if users only remember a portion of inform ation content, they
would be not able to search this inform ation contcnt. In the сабе of complex queries
such as range query or m ultiple-attribu te query, pure D H T is not efficient since a
lot of query messages must be sent.
In this thesis, we focus our intentions on DHT-based P2P networks and the
realization of m u ltiple-attribu te searching system based on DHT. Simple DHT-ba*scd
distribution of inform ation contents can be used to realize exact query (i.e. querying
information contents which have the same content name as the name in the query)
efficiently. However,it is not efficient to perform complex queries such as multiplea ttribute queries. It is because the number of inform ation contents th a t match a
complcx query is often large and therefore a lot of query messages must be sent to
resolve a query. To realize multiplc-attribute queries efficiently on DHT-based P2P
network, a framework b u ilt above the D H T is necessary.
Several papers (Gao, 2004; Y- Arakawa et al.,2005; M. Balazinska,2002; GarcesErice & Ross, 2004) have proposed solutions for above problems. In these solutions,
content names are represented by attribute and value (AV) pairs. In (Gao, 2Q04;
.1. Overview and M otivation
Y. Arakawa et al.. 2005),cach AV pair of a content name iỉ> hashed to a key.
In (M, Balazinska,2002; Garces-Erice & Ross, 2004), a content name forms an
attributed/value tree and each strand of an AV tree is matched to a key. Information
content is then distributed to cach node rcbponsiblc for the key based on DH 丁
routing algorithm. Information query is realized by looking up the nodes that are
responsible for keys created from a query AV pair. However, the distribution of
information content ba«ed on each AV pair or each strand of AV tree causes the
load unbalancing since there may be common AV pairs those appear in content
names w ith high probability. There arc several solutions for this problem such as
using m ultiple nodes in responsible for a key (Gao. 2004) or lim itin g number of
content item« corresponding to a key in a node (Y. Arakawa et al., 2005). However,
these approaches have not good balance between load balancing and query efficiency
though there is a tradeoff between load balancing and query efficiency.
In this thesis we propose a system callcd DSMAV (using Dynamic threshold
values for Searching Multiple Attribute/Value) to distribute and query information
content ba^ed on AV pairs on DHT-bai>ed network. The m ain idea of our system is to
lim it the number of inform ation contents, which are distributed by a distributed key.
to a reasonable threshold value. It is done by creating sub-keys from multiple AV
pairs in a contcnt name if these AV pairs already appear in a lot of contcnt names.
If the tota l number of inform ation contents of a sub-key is not over a threshold
value^ it would be used as a distributed key to distribute its information content.
Our system also keeps the mappings between sub-keys created from two common
AV pairs and distributed keys created from more two common AV pairs to reduce
query cost. O ur system satibfics three following requirements:
• Efficiency Distribution :A content name will be distributed to a few of nodes.
The number of the mappings and keys are distributed to every node reasonably.
Mappings of level-2 sub-key and level-m sub-key (m I 2) are stored in the nodes
th a t take responsibility for level-2 sub-keys.
• Efficiency Searching: If a query name contains an uncommon AV pair, it will
be resolved by querying only node which is responsible for the key created by
the uncommon AV pair. I f a query name contains only common AV pairs,
query messages w ill be sent quickly to nodes responsible for prim ary key and
level-2 sub-key. The node, which is responsible for level-2 sub-key, also use
mappings between level-2 sul>key and other sub-keys to route query messages
6
C h a p te r 1. Introduction
to other nodes, which store queried contents.
• Load balancing: The number of information contents those are corresponding
to a distributed key is not over a threshold value. Hence, even if the distrib ilio n of AV pairs in content names is skewed, the number of inform ation
contents stored in cach node can be kept balanced naturally.
O ur solution uses DHT-baised protocols to route messages.
The keys, which
are created by one or more AV pairs in the system, are routed by DHT-based
protocols such as CHORD. However, the inform ation contents, which include a set
of attribute-value pairs, arc distributed to the node« effectively.
An inform ation
content, which contains common AV pairs, is distributed to a set of nodes by using
keys, which are hashed from these common AV pairs. Sim ilarly, a query process
also hash one or more AV pairs of a query name into keys, which are sent to the
nodes take responsibility for the keys. Returned result would be a list of inform ation
contents that are matchcd w ith AV pairs of the query name.
O ur system can achievc both cfficicncy and a good degree of load balance even
when the distribution of AV pairs ill content names is skewed. O ur sim ulation shows
the effectiveness of our proposed system.
Our thesis І8 organized in 5 chapters. In first Chapter^ we would introduce the
overview of P2P network and P2P network models. W ith Chapter Two. we describe
conventional solutions for M ultiple-attributes inform ation searching aỉá our related
works. Then, we proposed a solution for M ultiple-attribu ted inform ation searching
based on dynamic threshold values on Structured P2P Network in Chapter Three.
Baling on our proposed solution, in Chapter Four wc implement simulations and
evaluate the solution qualitatively and quantitatively. The last of all, we summarize
carried works and talk about trends of development in the future in Chapter Five.
1.2
Communication network m odels
W ith the rising up of requirem ent of searching inform ation everywhere» there are
more and more people, who might want to share, search or download data from
network environment. Today,network environments allow users to create and share
information resources together Besides, the appearance of Internet, where contains
frequently updated inform ation, supports user's searching inform ation whenever and
1.2. Com m unication netw ork models
7
whenever they need. Users can search latest news, inform ation of a book, reviews
for business products, or download resource fr 이n on Internet easily. To do this,
some inform ation searching systems al lo พร users to search data from everywhere by
ubiquitous com puting models.
I biquitous com puting is a post-desktop model of human-computer interaction in
which inform ation processing has been thoroughly integrated into everyday objects
and activities. In the progress of ordinary activities, som eone,
•using” ubiquitous
computing engages many com putational devices and systems simultaneously, and
may not necessarily even be aware th a t they are doing so. For this reason, ubiquitous
computing model is used in many inform ation searching systems.
In these systems, the term MDistributed information” is described as lots of
inform ation resources th a t can connect together based on a communication network
environment. They contain devices, which store inform ation, such as P D A ’S,Mobile
phone. Laptops… Their data may be pieces of information such as the information
of a laptop or a book, or data files such as PDF, M P 3 ,Video files.... To support
users searching the inform ation flexibly, the systems use characteristics of an object
to represent inform ation. In other hand, each object may be described by M ultipleattributes based on its characteristic.
For instance, a laptop’s attributes can be
described by its characteristics as name of manufacturer, CPU, R A M , HDD... I f an
object is represented by more and more attributes, it would be more and more found
easily. For this reason,an information searching system w ill store lots of data, which
are created by objects’ representations. It hence needs to analysis and organize data.
Reasonably choosing of a communication network model and implementing a well
protocol would increase performance and effective of the system.
There are two typical communication network models, which a M ultiple-attribu te
inform ation searching system may be able to use, namely C licnt/S crvcr and P2P
network. Client/Server network model (Figure 1.1) is used widely and popularly
everywhere. This model is a distributed application architecture that partitions
tasks or workloads between service providers (servers) and service requesters, called
clients.
Often clients and servers operate over a computer network on separate
hardware. A server machine ib a high-pcrformance host that is running one or more
server programs which share its resources w ith clients. A client does not share any
of its resources, but requests a server’s content or service function. Clients therefore
in itia te com munication sessions w ith servers which await incoming requests.
W ith such architecture, C lient/Server network model has some advantages and
C hapter 1. Introduction
CUENT ff
lì
广 ÍỄÊ3
m ะJ
i
î
Figure
1 . 1 :C lient/Server
network model
disadvantages. In most cases, it allows the roles and responsibilities of a computing
system to be distributed among several independent computers th a t are known to
each other only through a network. This creates an additional advantage to this
architecture :ease of maintenance. For example, it is possible to replace, repair,
upgrade, or even relocate a server while its clients remain both unaware and un
affected by th a t change. A ll data are stored on the server, which generally have
better security controls than most clients. Servers can control access and resources,
to guarantee th a t only those clients w ith the appropriate permissions may access
and change data. Since data storage is centralized^ updating data are administered
easily. It functions w ith m ultiple different clients of different capabilities. However
C lient/Scrvcr model may be not reasonable to constructs distributed inform ation
searching systems because its disadvantages. Traffic congestion on the network has
been an issue since the inception of the Client/Server network model. The server
may become overloaded if it processes a lot of requests from the big number of simultaneous clients. I t may be hard to achieve scalability w ith a single point of failure.
In the case the server is damaged, clients' requests cannot be fu lfille d and the system
w ill cease. Moreover, activities of server require adm inistrators who are knowledge
able experts about network and system. This increases cost of management and
operation.
In contrast w ith the C licnt/S ervcr network model 1 P2P network model, where
it aggregated bandwidth actually increases as nodes are added, since the P2P net
work's overall bandwidth can be roughly computed
the sum of the bandwidths of
every node in that network. It is very suitable for distributed network architecture
1.3. P 2P n e tw o rk m o d e ls
9
Table 1.1:Comparison :Client/Server vs. P2P
Client/Server
P2P
A s y m m e tric : client and servers carry out Sym m etric: each node carry out the ьаше
difFcrcnt tasks
tasks; No distinction between Client and Servei
G lo b a l knowledge: servers have a global Local knowledge: nodes only know a small
set of other nodes
view of the network
D ecentralization: no global knowledge,
C entralization: Communications and
management arc centralized
only local interactions
Robustness: several nodes mav fail with
Single p o in t o f fa ilu re : a server failure
little or no im pact
brings down the system
H igh scalability: High aggregate capacity,
L im ite d scalability: servers easily
load d istribu tion
overloaded
E xp ensive: server storage and bandwidth Low-cost: storage bandwidth is contributed
by users.
capacity is not cheap
with composing of participants that make a portion of th e ir resources (disk storage,
network bandwidth or processing cycles). Peers are both suppliers and consumers of
resources, in contrast to the traditional Client-Server model where only servers sup
ply. and clients consumci. Organization of P2P network allows overcoming problems
of Client/Server network. Table 1,1 shows comparison of C licnt/Servcr and P2P
model. In next section, we describe P2P network architecture w ith its advantages
and disadvantages.
1.3
P2P network models
1.3.1
P 2 P N etw ork
Contrast to client-server networks, where network inform ation is stored Oïl a cen
tralized file server PC and made available to tens, hundreds, or thousands of client
PCs, the inform ation stored across P2P networks ib uniquely decentralized. A P2P
network model allows ใฑany PCs to pool their resources together. Individual re
sources like Desktops. Laptops, PDAs, and storage devices... are transformed into
shared, collective resources th a t arc accessible from every PC.
Because P2P PCs have their own hard disk drives th a t are accessible by all com
puters. each PC acts as both a client (information requestor) and a server (informa-
10
C h ap ter 1. Introduction
tion provider).
Ill
Figure 1,2, five P2P workstations arc shown. A ll five computers
can conimunicatc directly with each other and share one another.s resources.
Figure 1.2: Peer-To-Peer network model with 5 peers
A P2P system is a distributed collection of peer nodes. It may provide services
to other peers and consume services from other peers. Data of each node may
be a portion of common data of system. It means that common data of systeixi
is distributed to whole of possible nodes. Each node stores a portion of common
data and related services. Therefore, the load of storage and processing data is also
divided into several parts, which correspond with peer nodes. Each node stores and
processes a small number of pieces data, so if some node fail, the network system І8
almost not affcctcd seriously.
As a completely decentraiized model, it allows the development of applications
with :high-availability. fault-tolerance, and scalability such as Sharing of content ap
plication (Fire sharing, content delivery network, Gnutella. eMule, Akamay), Shar
ing of storage (Distributed file system), Sharing of CPU time (Parallel computing,
Grid)
There are three models of P2P network consisting of Unstructured, Hybrid and
Structured P2P network. Each model has individual advantages and disadvantages.
1.3.2
P 2 P N etw ork M odels
1 .3 .2 .1
U nstru ctu red P 2 P N etw o rk
Unstructured P2P network is an unregulated overlay where each participating node
connects to others rand이nly and arbitrarily, act as equals and merge the roles of
1.3. P 2P netw ork models
11
clients and server. Unstructured P2P network has no central server managing the
network, neither is there a central router. Joining of a new node based on a wellknown node that, is called a bootstrap node. The bootstrap node is flexible in the
now node's neighbor selection and routing mcchanisms. Unstructured P2P network
has profound impact on the efficiency of search. The typical Unstructured P2P
Network system is Gnutella (Gnutella;1999) (Figure 1.3).
For the purpose of files sharing. Gnutella system is constructed basing on Un
structured P2P model. Peer nodes of the svstem are organized randomly in Unstruc
tured overlay. Data of the system consist of files such as mp3, image, and video,
which each of whole peer nodes may share to other nodes. A new node, who wants
to join ill the system, would perform an operation ” Join” .
Firstly, it would contact with a bootstrap node. The bootstrap node would
send to back a li«t of existed node, which is choscn randomly. Then, the new node
would store and relate ship to the list of the existed node as its neighbors. From the
neighbor nodes, the new node may reach other nodes in the communication network.
Thence, the new node continue to get more addresses of other existed nodes from
the nodes it can reach, and so on. Finally, the new node is existed as peer node
of the system. Figure 1.3 shows an example of locating resources in a Gnutella-like
P2P environment.
«
เ๏๏
n
d
s
c
ь.ч>кні
Figure 1.3: Locating resources in a Gnutella-like P2P environment
In Unstructured P2P Networks, the search progress is unstructured, result of a
12
C hapter 1. Introduction
query may be not found. (Figure 1.3). Popular data is usually found easily because
of they are stored by many nodes. The number of content of popular data is enough
to locate a searched content by flooding the network.
Using flooding allows the
search to be performed easily and reliably in a highly connected overlay.
However. Hooding has some disadvantages. Firstly. It creates a lots of duplicate
query messages to send to a given node from its many neighbors.
Secondly;it
is diliicult to determine the appropriate T im e -to L iv e (T T L ) which controls the
flooding progress. A high T T L allows achieving high scarch re liability but requires
high overhead. Characteristics of the overlay affect the flooding effectiveness versus
the overhead. Otherwise, a lim ita tio n of йсаіе-free topologies is the high load on
very few number of hub nodes. Peers are not w illing to m aintain high loads as they
may not. want to store large number of entries for construction of overlay topology.
1.3.2.2
H y b rid P 2 P N e tw o rk
In Unstructured P 2P Network model, queries m ight not always be resolved. A l
though popular data m ight be stored at several peers, but if a peer node look up
data shared by only a few other peers, then it is highly likely th a t search w ill be not
successful. Because of there is no correlation between a peer and the data, which
is managed by it. There is no guarantee th a t flooding w ill find a peer th a t has the
dc\รircd data. Flooding also causes a high amount of signaling traffic in the whole
network and hence such networks have very poor search efficiency.
W hile the structure of the hybrid P2P network model m ight tacklc problems of
routing and lookup in U nstructured P2P Network. In that, peer nodes are divided
into two types, data nodcb and hub nodes. Data node stores real data and informa
tion of a hub node. Hub node is the same as a ,1server Î it do not store real data.
I t store only indexes to files in the network. To jo in in the systemโ each node has
to contact w ith a server and to provide real data,which it wants to share to others.
Since, t he server would create indexes for these files, and then it stores these indexes
in its database.
Figure 1.4 shows an example of H yb rid P2P Model. Servers are entrusted w ith
routing tasks. Each server store indexes to files and inform ation of data node and
other servers. Each data node would store data files and information of a server.
Data files are distributed among participating data nodes.
However, to look up
data, each node needs to use routing inform ation th a t is stored centrally in servers.
1.3. P 2 P netw ork m odels
13
Figuře 1.4: An example of Hybrid P2P Model
Figure 1.4 also shows an example of lookup a file in H yb rid P2P Model. Searching
node would send query message to its server. The server looks up filcs^ locations
in the list of indexes ill its database. It also sends query requests to other servers
simultaneously.
Those servers would search for indexes corresponding to desired
files and return the files* locations to the searching node.
Finally, the searching
nodo would contact the data nodes, which contain desired files, and download real
files dircctly. In this example, node A would connect to node в and then to download
data files from node в directly after getting location inform ation from servers.
Being a decentralized network model,eDonkey network is typical H ybrid P2P
Model. It best suited to share big files among users. It. allows sharing video files,
full music albums and softwares. There is no central hub for the network. Data file«
are distributed among peer nodes. eDonkey servers act as communication hubs for
the clients, allowing users to locate files w ith in the network.
Like Unstructured P2 P Network model. Hybrid P2P network is implemented
easily becausc of it does not need to implement distribution and routing algorithms‘
Furthermore, because of servers undertake lookup the locations of data files, Hy
brid model does not use query flooding or random searching like Unstructured P2P
model, since it is more efficient. However, m aintaining indexing server cause Mcen
tra l point of fa ilu re ř. If a server stops providing service, the data, which indexes are
created by them ill the server^ database, would not exist in the system. Therefore,
the whole system would cease if all servers stop operating.
- Xem thêm -