LҰP C Hӌ M Ө&&Ѫ6Ӣ DӲ L IӊU CҨU T R Ú C PR O T E I N
Phan MҥQK7Kѭӡng1, L âm T hӏ Hoà Bình 1ĈһQJ1Kѭ7RjQ1ĈRjQ7KLӋn M inh1
T rҫQ9ăQ/ăQJ2
1
Khoa Công ngh͏ WK{QJWLQ7U˱ͥQJĈ̩ i h͕c L̩c H ͛ng
10 HuǤQK9ăQ1JKӋ%LrQ+zDĈӗng Nai
{thuong,binh,dangnhutoan,dtminh}@lhu.edu.vn
2
9L͏Q.KRDK͕FYj&{QJQJK͏9L͏W1DP
0ҥFĈƭQK&KL4XұQ73+ӗ&Kt0LQK
[email protected]
7yP WҳW 7uP NLӃP Vӵ WѭѫQJ ÿӗQJ YӅ FҩX WU~F EұF ED FӫD FiF SURWHLQ
WURQJFѫ Vӣ GӳOLӋXFҩXWU~F SURWHLQOӟQOj PӝWEjLWRiQSKӭFWҥSYjÿzL
KӓLQKLӅX WKӡLJLDQ[ӱOê6ӕOѭӧQJFiFFҩXWU~FSURWHLQÿѭӧFNKiPSKi
QJj\ FjQJ JLD WăQJ QKDQK FKyQJ Yj WURQJ FiF Fѫ Vӣ Gӳ OLӋX YӅ FҩX WU~F
SURWHLQ YLӋF OұS FKӍ PөF FKR FiF SURWHLQ VӁ JL~S WKDR WiF WuP NLӃP VR
ViQK FҩX WU~F WKӵF KLӋQ QKDQK KѫQ Yj KLӋX TXҧ KѫQ 7Uong bài báo này
WUuQKEj\PӝWSKѭѫQJSKiSOұSFKӍPөFFKRFѫVӣGӳOLӋXFҩXWU~FSURWHLQ
WK{QJ TXD YLӋF SKkQ WtFK FҩX WU~F Wӯ ÿy U~W UD YHFWRU ÿһF WUѭQJ Yj [k\
GӵQJ PӝWFҩXWU~FFk\GӵD trên các YHFWRUÿһFWUѭQJÿӇOұSFKӍPөFFKR
FҩX WU~F SURWHLQ 9ӟL Fѫ Vӣ Gӳ OLӋX ÿm ÿѭӧF OұS FKӍ PөF YLӋF WuP NLӃP
PӝW FҩX WU~F SURWHLQ KRһF PӝW FҩX WU~F FRQ WURQJ SURWHLQ WUӣ QrQ QKDQK
FKyQJYjFKtQK[iFKѫQ
7ӯNKRi&ҩXWU~FSURWHLQEұFEDOұSFKӍPөFFѫVӣGӳOLӋXSURWHLQ.
1. Ĉһt Yҩn ÿӅ
Protein là mӝt chuӛi polypeptLGHÿѭӧc tҥo thành tӯ các axít amin. Nghiên cӭu
SURWHLQÿyQJYDLWUzTXDQWUӑng, vì chúng hoҥWÿӝng trong tҩt cҧ các quá trình sinh hӑc,
bao gӗm cҧ xúc tác enzym (tҩt cҧ các phҧn ӭng hóa hӑc trong tӃ bào sӕQJÿѭӧc xúc tác
1
bӣi enzyme protein), vұn chuyӇn các chҩWNKiFQKDXQKѭGѭӥQJNKtFiFLRQ«, và tín
hiӋu. ĈӇ hiӇXÿѭӧc mӕi quan hӋ giӳa cҩu trúc và chӭFQăQJ cӫa protein, các nhà nghiên
cӭu cҫn phҧi lҩy tӯ Fѫ Vӣ dӳ liӋu cҩu trúc protein và phân loҥi chúng thành các hӑ
protein khác nhau.VҩQ ÿӅ quan trӑng trong viӋc gom nhóm các protein dӵa trên sӵ
WѭѫQJÿӗng cҩu trúc nhҵm mөc tiêu:
o
Phát hiӋn các mӕi quan hӋ tiӃn hóa
o
Xác ÿӏQKFiFPRWLIÿRҥn lһp), là nhӳng cҩXWU~Fÿѭӧc hình thành bӣi sӵ
sҳp xӃp cӫa các axit amin trong không gian ba chiӅu
o
Phát hiӋn mӕi quan hӋ giӳa cҩu trúc và chӭFQăQJFӫa protein
o
Hӛ trӧ trong viӋc thiӃt kӃ thuӕc trӏ bӋnh
o
Phát hiӋn các trình tӵ FyOLrQTXDQÿӃn bӋQKXQJWKѭYjFiFEӋnh khác.
9ӟLVӵÿәLPӟLF{QJQJKӋYjSKiWWULӇQQKDQKFKyQJFӫD các SKѭѫQJSKiSxác
ÿӏQKFҩXWU~FSURWHLQQKѭSKѭѫQJSKiS;-quang WLQKWKӇ, NӻWKXұWSKkQWtFKTXDQJSKә
NMR«PӝWVӕOѭӧQJOӟQ FiFFҩXWU~F FKLӅX FӫD FiFSKkQWӱSURWHLQ PӟLÿm ÿѭӧF[iF
ÿӏQK &iFFҩXWU~F Qj\KLӋQÿDQJÿѭӧFOѭXWUӳWҥLQKLӅXFѫVӣGӳOLӋXWUrQLQWHUQHWYj
FXQJFҩSPLӉQSKtFKRFiFQKjQJKLrQFӭXFyWKӇNӇÿӃQ
o
Ngân hàng dӳ liӋu protein PDB [1] (Protein Data Bank) thuӝc phòng thí
nghiӋm RCSB (Research Collaboratory for Structural Bioinformatics): bao
gӗm 73153 cҩu trúc
o
SCOP Structural Classification of Proteins [2]: bao gӗm 38221 cҩu trúc
o
CATH Protein Structure Classification [3]: bao gӗm 104238 cҩu trúc
o
ModBase Database of Comparative Protein Structure Models (Sali Lab,
UCSF): bao gӗm 41140 cҩu trúc
7uPNLӃPVӵWѭѫQJÿӗQJYӅFҩXWU~F EұFba FӫDPӝWSURWHLQKRһFPӝWFҩXWU~F
con cӫDprotein EҩWNǤtrong FѫVӣGӳOLӋXFҩXWU~FSURWHLQngày càng OӟQ OjPӝWQKLӋP
YөNKyNKăQYjWӕQWKӡLJLDQ9uYұ\ FiFQKjVLQKKӑFÿDQJFҫQPӝWSKѭѫQJWLӋQÿӇWuP
NLӃPFѫVӣGӳOLӋXFҩXWU~FSURWHLQQKDQKFKyQJ YjKLӋXTXҧWѭѫQJWӵQKѭFiFK%/$67
[5] WuPNLӃP trong FѫVӣGӳOLӋXWUuQKWӵ %jLWRiQWuPNLӃPYjSKkQORҥLSURWHLQWKѭӡQJ
WUҧLTXDKDLJLDLÿRҥQU~WWUtFKÿһFWUѭQJP{WҧFKRSURWHLQ YjÿRVӵJLӕQJQKDXYӅÿһF
WUѭQJFӫDFiFSURWHLQÿӇSKkQORҥLFK~QJ
2
ĈӇ WKӵF KLӋQ U~W WUtFK ÿһF WUѭQJ FӫD Fҩu trúc protein Fy UҩW QKLӅX WKXұW WRiQ,
WKXұWWRiQ&766>6@[ҩS[ӍFҩXWU~FFiF&Į[ѭѫQJVӕQJFӫDSURWHLQ EҵQJ PӝWÿѭӡQJ
VSOLQHPӏQYӟLÿӝFRQJWӕLWKLӇXVDXÿyOѭXWUӳÿѭӡQJFRQJJyF[RҳQYjFҩXWU~FEұF
KDLFӫDPӛLQJX\rQWӱ&ĮWURQJPӝWPөFFKӍVӕGӵD WUrQSKpSEăP
ProGreSS [5@OjPӝWSKѭѫQJSKiS PӟL, WKӵFKLӋQU~WWUtFKÿһFWUѭQJWӯFҩXWU~F
NӃWKӧSYӟLWUuQKWӵWK{QJTXDPӝWFӱDVәWUѭӧWWUrQFҩXWU~F[ѭѫQJVӕQJFӫDSURWHLQ
ĈһFWUѭQJYӅFҩXWU~FFӫDQyWѭѫQJWӵQKѭFiFÿһFWUѭQJU~WUDWӯ&766ÿӝcong, góc
[RҳQYjWK{QJWLQFҩXWU~FEұFKDLFiFFKXӛLÿһFWUѭQJÿѭӧFWtQKWRiQWӯYLӋFVӱGөQJ
PD WUұQ ÿLӇP QKѭ 3$0 KRһF %/2680 *LӕQJ QKѭ &766 FiF ÿһF WUѭQJ U~W UD Wӯ
ProGreSS NK{QJSKҧLOjÿһFWUѭQJFөFEӝ
7KXұWtoán PSIST[7] OjPӝWWURQJVӕFiFWKXұWWRiQKLӋXTXҧYuFyÿӝFKtQK[iF
WѭѫQJÿӕLFDR, ciFKWLӃSFұQFӫD WKXұWWRiQ36,67 là ELӃQÿәLFiFWK{QJWLQFҩXWU~FFөF
EӝFӫDPӝWSURWHLQWKjQKPӝWWUuQKWӵ" YjGӵDtrên WұSFiF³WUuQKWӵ´ÿy [k\GӵQJPӝW
cây KұX WӕSKөFYөFKRYLӋFWuPNLӃP6RYӟi cách rút trích FiFÿһFWUѭQJFөFEӝWӯPӝW
axit amin GX\QKҩWthì cách rút trích ÿһFWUѭQJWKHRFӱDVәWUѭӧWWURQJKѭӟQJWLӃSFұQ
FӫD WKXұWWRiQ36,67 OjWӕWKѫQYuYHFWRUÿһFWUѭQJKjPFKӭD FҧKDLWK{QJWLQWӏQKWLӃQ
và xoay ӣ ErQ WURQJ Sau khi các veFWѫ ÿһF WUѭQJ ÿѭӧF FKXҭQ KyD FҩX WU~F SURWHLQ
ÿѭӧFFKX\ӇQWKjQKPӝWFKXӛLJӑLOjWUuQKWӵÿһFWUѭQJ-FҩXWU~FFӫDFiFNêKLӋXÿѭӧF
UӡLUҥFKRi.
Tuy nhiên viӋc tìm kiӃm trên cây hұu tӕ thӵc sӵ FKѭDÿҥt hiӋu quҧ cao vӅ tӕFÿӝ,
thuұt toán PSISA[8] sӱ dөng hѭӟng tiӃp cұQWUtFKYHFWRUÿһFWUѭQJJLӕQJ36,67QKѭQJ
thay vì dùng cây hұu tӕ thì thuұt toán này sӱ dөng mҧng hұu tӕ WURQJ SKѭѫQJ SKiS
ÿiQK FKӍ mөc nhҵP WăQJ WӕF ÿӝ tìm kiӃm. KӃt quҧ thӵc nghiӋm trong PSISA chӍ ra
rҵQJÿiQKFKӍ mөc bҵng mҧng hұu tӕ giúp WăQJtӕFÿӝ tìm kiӃPQKѭQJÿӗng thӡLFNJQJ
OjPJLDWăQJkhҧ QăQJVӱ dөng bӝ nhӟ vӟi hӋ sӕ OrQÿӃQKѫQVRYӟi cây hұu tӕ QKѭ
trong PSIST.
7URQJEjLEiRQj\WUuQKEj\ PӝWSKѭѫQJSKiS OұSFKӍ PөFFKRFѫ VӣGӳOLӋX
FҩXWU~FSURWHLQWK{QJTXDYLӋFNӃWKӯD WKXұWWRiQ36,67 ÿӇ U~WUDYHFWRUÿһFWUѭQJYj
WӯWұSFiFYHFWRUÿһFWUѭQJEjLEiRÿӅ[XҩW[k\GӵQJPӝWFҩXWU~FFk\FKӍPөF GӵDWUrQ
YLӋFJKpSQKiQKFiFFKXӛLYHFWRUÿһFWUѭQJFҩXWU~FFk\Qj\YӯDJL~SKҥQFKӃYLӋFVӱ
GөQJEӝQKӟYjYӯDFKRSKpSWuPNLӃPWUrQNK{QJJLDQFӫDWRjQEӝFiFFҩXWU~FWKXӝF
3
FiFKӑSURWHLQNKiFQKDX, ÿLӅXQj\JL~SFKR YLӋFWuPNLӃPPӝWFҩXWU~FSURWHLQKRһF
PӝWWLӇXFҩXWU~FWURQJSURWHLQWUӣQrQQKDQKFKyQJYjFKtQK[iFKѫQ
&iFQӝLGXQJ FzQOҥLFӫDEjLEiRÿѭӧF WUuQKEj\QKѭVau: SKҫQWKӭKDLWUuQKEj\
SKѭѫQJSKiSOұSFKӍPөFGӳOLӋXFҩXWU~FSURWHLQFiFKWKӭFU~WWUtFKYHFWRUÿһFWUѭQJ
FKXҭQKyDYeFWRUÿһFWUѭQJFNJQJQKѭYLӋF[k\GӵQJFk\FKӍPөFSKҫQWKӭEDQrXOrQ
PӝWVӕWKӱQJKLӋPWӯQJXӗQGӳOLӋXFҩXWU~FSURWHLQ YLӋF WUX\YҩQWUrQQJXӗQGӳOLӋX
Qj\SKҫQFXӕLFQJWUuQKEj\PӝWVӕÿiQKJLiYjNӃWOXұQ
2. /ұSFKӍPөFGӳOLӋXFҩXWU~FSURWHLQ
a) 5~WWUtFKYHFWRUÿһFWUѭQJ
0ӛLSURWHLQOjPӝWWәKӧSFӫDPӝWFKXӛLFyWKӭWӵFiFD[LWDPLQUHVLGXHÿѭӧF
OLrQNӃWYӟLQKDXEӣLFiFOLrQNӃWSHSWLGH0ӛLUHVLGXHJӗPPӝW& D , các N và C khác.
&KLӅXGjLFӫDOLrQNӃWJyFOLrQNӃWYjFiFJyF[RҳQKRjQWRjQ[iFÿӏQKFҩXWҥRYjKuQK
KӑFFӫDSURWHLQ
ĈӝGjLOLrQNӃWOjNKRҧQJFiFKJLӳDFiFQJX\rQWӱÿѭӧFQӕLNӃW ÿѭӧFWtQKEҵQJ
o
ÿѫQYӏ Amstrong ( A )YjJyFOLrQNӃWOjJyFJLӳDKDLOLrQNӃWFӝQJKRiWUӏFӫDFQJPӝW
o
QJX\rQWӱ9tGөÿӝGjLOLrQNӃWJLӳDFһSQJX\rQWӱ1-C là 1.33 A JyFOLrQNӃWJLӳD
CD-N và N-C là 1220.
Hình 1ĈӝGjLOLrQNӃWYjFiFJyFOLrQNӃWJLӳDFiFQJX\rQWӱ
*yF[RҳQGQJÿӇP{WҧFiFFҩXWU~FFyWKӇ[RD\TXDQKFiFOLrQNӃW*LҧVӱWDFy
EӕQ ngX\rQWӱÿѭӧFNӃWQӕLWK{QJTXDED OLrQNӃW%i-1, Bi và Bi+1WKuJyF[RҳQFӫDPӕL
OLrQNӃW%i ÿѭӧFÿӏQKQJKƭDEҵQJJyFQKӓQKҩWFӫDFiFKuQKFKLӃX%i-1 và Bi+1 OrQPһW
SKҷQJYX{QJJyFYӟL%i
4
Hình 2&iFJyF[RҳQI, M và Z JLӳDFiFQJX\rQWӱ
ĈӇFKөSÿѭӧFFiFÿһFWUѭQJFөFEӝPӝWFiFKFKtQK[iFKѫQ FҫQSKҧLWUtFK[XҩW
FiFÿһFWUѭQJWӯPӝWWұSFiFUHVLGXHFөFEӝĈӇWҥRUDYHFWRUÿһFWUѭQJFөFEӝÿҫXWLrQ
P{WҧWӯQJUHVLGXHULrQJELӋWYj[iFÿӏQKVӵOLrQKӋJLӳDPӝWFһSUHVLGXHYjJLӳDPӝW
o
WұSFiFUHVLGXHYӟLQKDX9ӟLPӛLUHVLGXHÿӝGjLOLrQNӃWCD-N là 1.46 A OLrQNӃW&D-C
o
là 1.51 A YjJyFJLӳD&D-N và CD-C là 11601KѭYұ\WҩWFҧFiFWDPJLiFWҥRQrQWӯFiF
QJX\rQWӱ1-CD-&FӫDPӛLUHVLGXHOjWѭѫQJÿѭѫQJQKѭQKDXYjPӛLUHVLGXHFyWKӇÿҥL
GLӋQEӣLPӝWWDPJLiF
.KRҧQJ FiFK G JLӳD PӝW FһS UHVLGXH ÿѭӧF [iF ÿӏQK GӵD WUrQ NKRҧQJ FiFK
EXFOLGH JLӳD KDL QJX\rQ Wӱ &D FӫD FK~QJ &{QJ WKӭF ÿѭӧF Vӱ GөQJ ÿӇ WtQK WRiQ
NKRҧQJFiFKJLӳDhai residue
(1)
Góc T JLӳDPӝWFһSUHVLGXHÿѭӧF[iFÿӏQKEҵQJJyFJLӳDKDLPһWSKҷQJWҥRQrQ
Wӯba QJX\rQWӱ1-CD-&FӫDPӛLUHVLGXH
Hình 3. .KRҧQJFiFKYjJyFJLӳDKDLUHVLGXH
5
.KRҧQJ FiFK Yj JyF Oj EҩW ELӃQ ÿӕL YӟL SKpS GӏFK FKX\ӇQ Yj [RD\ SURWHLQ
.KRҧQJ FiFK (XFOLGH JLӳD hai QJX\rQ Wӱ &D ÿѭӧF WtQK WUӵF WLӃS Wӯ FiF WRҥ ÿӝ WURQJ
không gian ba FKLӅXFӫDFK~QJ*yFJLӳDKDLPһWSKҷQJWҥRQrQWӯEӝED ngu\rQWӱ1CD-&ÿѭӧFWtQKWRiQGӵDWUrQJyFFӫDFһSYHFWRUSKiSWX\ӃQFyJӕF[XҩWSKiWWӯQJX\rQ
Wӱ&D FӫDPӛLPһWSKҷQJ9HFWRUSKiSWX\ӃQQj\ÿѭӧFWtQKEӣLF{QJWKӭF (2)
(2)
*yFJLӳDKDLYHFWRUSKiSWX\ӃQQYjQÿѭӧFWtQKWKHRF{QJWKӭF (3)
(3)
ĈӇ P{Wҧ FiF ÿһFWUѭQJFөF EӝWӯPӝWWұSFiFUHVLGXH QKyP WiF JLҧ GQJ PӝW
FӱD Vә Fy NtFK WKѭӟF Z WUѭӧW TXD WUrQ FKXӛL & D [ѭѫQJ VӕQJ FӫD SURWHLQ &iF NKRҧQJ
FiFKYjFiFJyFJLӳDUHVLGXHÿҫXWLrQYjFiFUHVLGXHFzQOҥLWURQJFӱDVәVӁÿѭӧFWtQK
toán và thêm vào vHFWRUÿһFWUѭQJ, mӛLFӱDVәӭQJYӟLPӝWYHFWRUÿһFWUѭQJ.
&KRWұS3 ^S1,p2,..pn`ÿҥLGLӋQFKRPӝWSURWHLQWURQJÿyS i OjUHVLGXHWKӭLWURQJ
FҩX WU~F [ѭѫQJ VӕQJ FӫD SURWHLQ 9HFWRU ÿһF WUѭQJ FӫD SURWHLQ ÿѭӧF ÿӏQK QJKƭD Oj
Pv={pv1, pv2« pvn-w+1}, trong ÿyZOjÿӝUӝQJFӱDVәWUѭӧWYjS vi OjYHFWRUÿһFWUѭQJFy
pvi=(d(pi,pi+1FRVșSi,pi+1),..., d(pi,pLZí), FRVșSi,pLZí))
YӟLGSi, pjOjNKRҧQJFiFKJLӳDKDL UHVLGXHWKӭLYjMYjFRVșSi,pjFKREӣLJyFJLӳDhai
UHVLGXH9ӟLFӱDVәFyNtFKWKѭӟFZ WKuFKLӅXFӫDPӛLYHFWRUÿһFWUѭQJSvi là 2(w-1).
b) C huҭQKRiYHFWRUÿһFWUѭQJ
'RFiFYHFWRUÿһFWUѭQJFKӭDFiFWK{QJWLQYӅNKRҧQJFiFKYjJyFOLrQNӃWYӟL
ÿѫQYӏÿROѭӡQJNKiFQKDXQrQFҫQSKҧLÿѭӧFFKXҭQKRi7KrPQӳDYLӋFFKXҭQKRiVӁ
JL~SKҥQFKӃEӟWPLӅQJLiWUӏFӫDFiFWKjQKSKҫQWURQJYHFWRUÿһFWUѭQJ*yFș WKXӝF
SKҥPYL>ʌ@YuYұ\FRVș[ א-1, 1]. ĈӇFKXҭQKyDNKRҧQJFiFKFK~QJWDFҫQSKҧLELӃW
FұQWUrQ YӅNKRҧQJFiFKJLӳDresidue WKӭL YjUHVLGXHWKӭ (i+w-1) trong protein.
7ҩWFҧFiFNKRҧQJFiFKYjFiFJyFÿӅXÿѭӧFFKXҭQKRiYjÿѭDYӅPӝWVӕQJX\rQ
WURQJNKRҧQJ>E-1] YӟLEOjPӝWWKDPVӕ FKRWUѭӟF.
0ӛLNKRҧQJFiFKGWURQJYHFWRUÿһFWUѭQJVӁÿѭӧFFKXҭQKRiWKHRc{QJWKӭF(4)
6
d=
«
»
d *b
« 4.025 * ( w 1) » (4)
¬
¼
WURQJF{QJWKӭFJLiWUӏKҵQJVӕ5 OjNKRҧQJFiFKWUXQJEuQKJLӳDKDLQJX\rQWӱ
CD , và ZOjÿӝUӝQJFӱDVәWUѭӧW
&iFJyFWURQJYHFWRUÿһFWUѭQJVӁÿѭӧFFKXҭQKRiWKHRF{QJWKӭF(5)
cos T =
« (cos T 1) * b »
«¬
»¼ (5)
2
6DXNKLFKXҭQKRiFҩXWU~FSURWHLQVӁÿѭӧFELӇXGLӉQEҵQJPӝWFKXӛL³WUuQKWӵ´
FiFJLiWUӏUӡLUҥFWKHRFiF YHFWRUÿһFWUѭQJWURQJÿyYHFWRUWKӭLELӇXGLӉQÿһFWUѭQJ
FӫDUHVLGXHWKӭLWURQJFKXӛL[ѭѫQJVӕQJFӫDSURWHLQ
c) X ây dӵng cây chӍ mөc
ĈӇ WLӃQKjQKOұSFKӍ PөF FKRWұSGӳOLӋXFҩXWU~FSURWHLQEjLEiRÿӅ[XҩW[k\
GӵQJPӝWFҩXWU~FFk\QKLӅXQKiQKWKHR WKXұWWRiQQKѭWURQJKuQK.
ĈҫXWLrQWKXұWWRiQVӁÿӑFGӳOLӋXFҩXWU~FFӫDWӯQJSURWHLQWURQJFѫVӣGӳOLӋX
VDXÿyWLӃQKjQKU~WWUtFKÿһFWUѭQJGӵDWKHRWKXұWWRiQÿmWUuQKEj\ QKҵP³WUuQKWӵ´KRi
FҩXWU~FEDFKLӅXFӫD PӛLSURWHLQEҵQJPӝWWұSFiFYHFWRUÿһFWUѭQJӭQJYӟLFҩXWU~F
[ѭѫQJ VӕQJ FӫD Qy 6DX NKL FKXҭQ KRi FiF YHFWRU ÿһF WUѭQJ PӛL ³WUuQK Wӵ´ FҩX WU~F
SURWHLQVӁÿѭӧF WKrPYjRWURQJFk\FKӍPөFÿӇSKөFYөFKRYLӋFWUDFӭX.
Hình 4. 7KXұWWRiQWҥRFk\FKӍPөFGӵDWUrQÿһFWUѭQJFҩXWU~FFӫDSURWHLQ.
7
9tGө;k\GӵQJFk\FKӍPөFWӯWұSJӗPViX FҩXWU~FSURWHLQÿmWUuQKWӵKRiӣ
ÿk\PӛLWUuQKWӵSURWHLQÿѭӧFELӇXGLӉQEӣLPӝWWұS FiFNêWӵPӛLNêWӵӭQJYӟLPӝW
YHFWRUÿһFWUѭQJÿmÿѭӧFFKXҭQKRi
P1={a,b,d,f,a,h}; P2={b,a,d,b,d}; P3={a,b,c,b,d,s,f};
P4={c,a,b,a,b,c}; P5={c,a,b,c,c,b}; P6={a,c,b,a,d};
.ӃWTXҧVӁÿѭӧFFҩXWU~FFk\QKѭKuQK
Hình 5. Cây FKӍPөFGӵDWUrQÿһFWUѭQJFҩXWU~FFӫDcác protein.
d) T ruy vҩn dӳ liӋu trên cây chӍ mөc
&KRPӝWWUX\YҩQ4WUѭӟFWLrQcác vector ÿһFWUѭQJFӫDFҩXWU~F4VӁÿѭӧFtrích
[XҩWYjFKX\ӇQÿәLWKjQKPӝWFKXӛL³WUuQKWӵ´QKѭP{WҧWURQJPөFD và 2b6DXÿy
vLӋFWUDFӭXVӁÿѭӧFWKӵFKLӋQ TXDEDJLDLÿRҥQWuPNLӃP[ӃSKҥQJYj FKӑQWӕLѭX. Giai
ÿRҥQ WuP NLӃP WKӕQJ Nr các FҩX WU~F WURQJ Fѫ Vӣ Gӳ OLӋX SK KӧS YӟL Q theo PӝW
QJѭӥQJ NKRҧQJ FiFK H JLӳD FiF YHFWRU JLDL ÿRҥQ WKӭ Kai [ӃS KҥQJ WҩW Fҧ FiF SURWHLQ
FKӭD FKXӛL SK KӧS WuP WKҩ\, và JLDL ÿRҥQ sau cùng Vӱ GөQJ WKXұW WRiQ SmithWaterman[9@ÿӇWuPNLӃPFҩXWU~FWѭѫQJÿӗQJFөFEӝ WӕWQKҩW GӵDWUrQWUX\YҩQQ và
WұSJӗPFiFSURWHLQÿѭӧFOӵDFKӑQ.
7KXұWWRiQ WuPNLӃP PүXWUX\YҩQ Q trên FҩXWU~FFk\FKӍ PөF ÿѭӧc trình bày
QKѭVDX
InputÿRҥQFҩXWU~FSURWHLQ4QJѭӥQJVRNKӟSQKӓQKҩWH
Output7ұSFiFFҩXWU~FSURWHLQWKRҧÿLӅXNLӋQWuPNLӃPÿѭӧFVҳS[ӃSWKHRVӕ
OѭӧQJUHVLGXHVRNKӟSJLҧPGҫQ
F unction Search WUHH5RRWPͱFLFKX͟LWUX\Y̭Q4QJ˱ͩQJH ){
While (i FKL͉XFDRFk\ - ÿ͡GjLFKX͟L4^
-
*RPQKiQKWKHRPͱFL
-
)RUHDFKQRGHW̩LPͱFL
8
o
1͇XQRGH1>M@WUQJNKͣSYͣL4 [0])
)RU HDFK QKiQK FRQ FͯD 1>M@ 1͇X VR NKͣS YͣL SK̯Q
FzQO̩LFͯDFKX͟L4WKR̫QJ˱ͩQJH thì:
o
x
7KrPQKiQKYjRW̵SN͇WTX̫
x
/R̩LQKiQKNK͗L cây
Return Search (Root, i +1, Q[0], H);
1J˱ͫFO̩L
Return Search (N[j], i +1, Q[i+1], H);
} end while
}end function
)XQFWLRQ4XHU\WUHH5RRWP̳XWUX\Y̭Q4WRSNP̳XF̯QFK͕QQJ˱ͩQJH){
-
.KͧLW̩RW̵SN͇WTX̫U͟QJ
-
5~WWUtFKÿ̿FWU˱QJYjW̩RFKX͟LWUuQKWF̭XWU~FFKRWUX\Y̭Q4
-
;k\GQJFk\FK͑PͭF
-
Search (Root, i =0, Q, H);
-
6̷S[͇SW̵SN͇WTX̫ JL̫PG̯QWKHRV͙O˱ͫQJVRNKͣS m ;
-
&K͕QNP̳XW͙WQK̭WWURQJW̵SN͇WTX̫YjiSGͭQJWKX̵WWRiQ6PLWK -Waterman
WuPV̷SKjQJF̭XWU~FFͭFE͡W͙WQK̭W
}end function
9tGө: 7uPNLӃPPүXWUX\YҩQ4 ^EFGE`trên cây FKӍPөFWӯWұSFiFFҩXWU~FSURWHin
ÿmWUuQKWӵKRi YӟL QJѭӥQJH=3. 7ұSJӗPP1={a,b,d,f,a,h}; P2={b,a,d,b,c};
P3={a,b,c,d,b,s,f}; P4={c,b,c,a,b,c}; P5={c,b,c,c,d,b}; P6={a,c,b,a,d}
x
TUX\YҩQWҥLPӭFJӕF PӭFÆ 7ұSNӃWTXҧ ^P2 (VӕVRNKӟSP )}
9
x
TUX\YҩQWҥLPӭF1Æ 7ұSNӃWTXҧ ^P4 (VӕVRNKӟSP ), P3 (m=4)}
x
Truy YҩQWҥLPӭF2Æ 7ұSNӃWTXҧ ^P5 (VӕVRNKӟSP )}
3. 0ӝWVӕNӃWTXҧWKӱQJKLӋP
a) C ác nguӗn dӳ liӋu cҩu trúc protein
&iF FҩX WU~F SURWHLQ EұF ED ÿѭӧF OѭX WUӳ QKLӅX WҥL QJkQ KjQJ Gӳ OLӋX 3URWHLQ
(PDB ± Protein Data Bank>@ÿyOj NKROѭXWUӳFKtQKFKRWKӵFQJKLӋP[iFÿӏQK FҩX
trúc EұF ED FӫD Protein. Ngân hàng PDB ÿѭӧF WҥR UD YjR QăP WҥL 3KzQJ WKt
QJKLӋPTXӕFJLD%URRNKDYHQ %1/ӣ0ӻ1KӳQJFҩXWU~FÿѭӧF [iFÿӏQKQKӡ VӱGөQJ
SKѭѫQJSKiSWLQKWKӇKӑF+LӋQ QD\FyKѫQ 73153 FҩXWU~FSURWHLQWURQJNKROѭXWUӳWҥL
PDB và KjQJQăP có KѫQF{QJWUuQKPӟLÿѭӧFOѭXWUӳ
&iF SURWHLQ WURQJ Fѫ Vӣ Gӳ OLӋX 6&23 >@ ÿѭӧF Wә FKӭF WҥL 3KzQJ WKt QJKLӋP
6LQKKӑF3KkQWӱFӫD+ӝLÿӗQJ1JKLrQFӭX<NKRD05&ӣ&DPEULGJH$QKP{Wҧ
FiFPӕLTXDQKӋFҩXWU~FYjWLӃQKyDJLӳDFiFFҩXWU~FSURWHLQÿmÿѭӧFELӃWÿӃQ. SCOP
ÿmÿѭӧFFKҩSQKұQOjSKKӧSQKҩWYjSKkQORҥLFiFWұSGӳOLӋXÿiQJWLQFұ\QKҩWGR
WKӵFWӃUҵQJ6&23[k\GӵQJTX\ӃWÿӏQKSKkQORҥLFӫDQyGӵDWUrQQKӳQJTXDQViWWUӵF
TXDQFiF\ӃXWӕFҩXWU~FFӫDSURWHLQGRFiFFKX\rQJLDWKӵFKLӋQ3URWHLQÿѭӧFSKkQORҥL
PӝWFiFKFyWKӭEұFSKҧQiQKPӕLTXDQKӋFӫDFK~QJYӅFҩXWU~FYjWLӃQKyD&iFFҩS
FKtQKFӫD KӋ WKӕQJSKkQFҩSOjKӑJLDÿuQKIDPLO\GӵDWUrQFiFPӕLTXDQKӋWLӃQ
10
KyDFӫDFiFSURWHLQVLrXKӑVXSHUIDPLO\GӵDWUrQPӝWVӕÿһFÿLӇPFKXQJYӅFҩX
WU~FYjJҩSFXӝQIROGGӵDWUrQFiF\ӃXWӕFҩXWU~FEұFKDL
&ѫVӣGӳOLӋX&$7+[3@ÿѭӧFWәFKӭFWҥLĈҥLKӑF8&//RQGRQKLӋQFy104238
cҩu trúc, VӱGөQJSKѭѫQJSKiSWӵÿӝQJÿӇSKkQORҥLSURWHLQ YjFNJQJFy QKӳQJÿyQJ
JySFӫDFiFFKX\rQJLDNKLSKѭѫQJSKiSWӵÿӝQJNK{QJFKRNӃWTXҧÿiQJWLQFұ\&ѫVӣ
Gӳ OLӋX &$7+ ÿѭӧF [k\ GӵQJ EҵQJ FiFK iS GөQJ F{QJ Fө VR ViQK FҩX WU~F EұF KDL
SSAP 66$3 Vӱ GөQJ PӝW Nӻ WKXұW OұS WUuQK TX\ KRҥFK ÿӝQJ KDL OӟS ÿӇ VR NKӟS KDL
protein và tìm ra FҩXWU~FOLrQNӃWWӕLѭXFӫa hai protein.
Cѫ Vӣ Gӳ OLӋX FSSP [4@ ÿm ÿѭӧF WҥR UD WKHR SKѭѫQJ SKiS SKkQ ORҥL '$/, Yj
ÿѭӧFWәFKӭFWҥL9LӋQ7LQVLQKKӑFFKkXÆX(%,1yFXQJFҩSPӝWSKkQORҥLSKӭFWҥS
FӫDFiFFҩXWU~FSURWHLQ6ӵWѭѫQJWӵJLӳDKDLSURWHLQÿѭӧF[iFÿӏQKGӵDWUrQFҩXWU~F
EұFKDLFӫDFK~QJ9LӋFÿiQKJLiWӯQJFһSSURWHLQOjPӝWF{QJYLӋFWӕQWKӡLJLDQYuYұ\
YLӋFVRViQKJLӳDPӝWÿҥLSKkQWӱYjWҩWFҧFiFÿҥLSKkQWӱFӫDFiFFѫVӣGӳOLӋXFyWKӇ
PҩWFҧQJj\'RÿyPӝWSURWHLQÿҥLGLӋQFKRPӛLOӟSÿѭӧF[iFÿӏQKYjPӛLSURWHLQPӟL
FKӍSKҧLVRNKӟSYӟLSURWHLQÿҥLGLӋQFӫDWӯQJORҥL
b) Tә chӭFOѭXWUӳ
&iFFҩXWU~F EұFED FӫD SURWHLQWK{QJWKѭӡQJÿѭӧFOѭXWUӳWKHRFiF ÿӏQKGҥQJ
QKѭ 00'% ³0ROHFXODU 0RGHOLQJ 'DWD%DQN´ ÿӏQK GҥQJ FKXҭQ P{ Wҧ WK{QJ WLQ FiF
OLrQNӃWSHSWLGHPP&,)³&KHPLFDO,QWHUFKDQJH)RUPDW´GҥQJFѫVӣGӳOLӋXTXDQKӋ
Yj3'%³3URWHLQ'DWD%DQN´GҥQJFӝWYăQEҧQYӟLQKLӅXPөFWK{QJWLQWtFKKӧS
7URQJVӕFiFÿӏQKGҥQJQrXWUrQWKuÿӏQKGҥQJ3'%OjSKәELӃQKѫQFҧWURQJWұS
WLQ3'%OѭXWUӳFiF WK{QJWLQYӅWRҥÿӝFӫDFiFQJX\rQWӱWURQJNK{QJJLDQEDFKLӅX
WKHRKӋTX\FKLӃX(XFOLGHQJRjLUDFzQFyFiFWK{QJWLQYӅWiFJLҧFiFWKDPFKLӃXYj
FiFNӃWTXҧWKӵFQJKLӋP[iFÿӏQKFҩXWU~FSURWHLQ
11
Hình 60ӝWSKҫQFҩXWU~FWұSWLQ3'%
1KyPWiFJLҧEjLEiRÿmWKӵFKLӋQ WKXWKұSFiFFҩXWU~FÿmÿѭӧFF{QJEӕWӯFiF
QJXӗQ>@GѭӟLÿӏQKGҥQJ3'%YjWәFKӭFOѭXWUӳWURQJPӝWFѫVӣGӳOLӋXTXDQ
KӋÿӇWKXұQWLӋQFKRYLӋFOұSFKӍPөFYjWUDFӭX0{KuQKFѫVӣGӳOLӋXTXDQKӋÿѭӧFÿӅ
[XҩWQKѭWURQJKuQK.
Hình 7. Mô huQKFѫVӣGӳOLӋXTXDQKӋOѭXWUӳWK{QJWLQFҩXWU~FSURWHLQ
12
c) Mӝt sӕ kӃt quҧ thӱ nghiӋm
'ѭӟL ÿk\ Oj PӝW Vӕ NӃW TXҧ WKӱ QJKLӋP 7ұS Gӳ OLӋX ' GQJ FKR WKӱ QJKLӋP
ÿѭӧF U~W WUtFK Wӯ Fѫ Vӣ Gӳ OLӋX 6&23 >@ JӗP FiF SURWHLQ WKXӝF Fҧ EӕQ OӟS FXӝQ D,
SKLӃQE, D+E và D/E7ұSGӳOLӋXEDRJӗPSURWHLQWKXӝFPӛL³VLrXKӑ´VXSHUIDPLO\
WURQJWәQJVӕ³VLrXKӑ´FӫD6&23QKѭYұ\FyWәQJFӝQJSURWHLQ0үXWUX\
YҩQVӁÿѭӧFOҩ\QJүXQKLrQWӯWұSGӳOLӋX'WURQJFiFWKӱQJKLӋP&yWKDPVӕWURQJ
FiFWKӱQJKLӋP JӗPZOjÿӝUӝQJFӱDVәEOjJLiWUӏFKXҭQKRiH QJѭӥQJNKRҧQJFiFK
WӕLWKLӇXJLӳDKDLYHFWRUOOjÿӝGjLWӕLWKLӇXSKҧLÿҥWFӫDFKXӛLVRNKӟSOӟQQKҩWYjNOj
Vӕ OѭӧQJ SURWHLQ ÿѭӧF Oҩ\ Wӯ WUrQ [XӕQJ WKHR ÿLӇP Vӕ 7KXұW WRiQ ÿѭӧF FjL ÿһW EҵQJ
C++ và cKҥ\ WKӱ QJKLӋP WUrQ P{L WUѭӡQJ :LQGRZV YӟL FҩX KuQK Pi\ &38 'XDO
1.6GHz, RAM 2GB. 6ӕSURWHLQWKӇ KLӋQWURQJÿӗWKӏOj VӕWUXQJEuQKFiFSURWHLQWuP
6ӕSURWHLQWuPWKҩ\
6ӕSURWHLQWuPWKҩ\
WKҩ\WURQJ³siêu Kӑ´ TXDFiFWKӱQJKLӋP.
.tFKWKѭӟFFӱDVәZ
Hình 8 6ӕ SURWHLQ
WuP WKҩ\
WURQJFQJVXSHUIDPLO\WKHRVӕ
OѭӧQJNFXWRII
(w=3, b=10, H=3 và l=10)
Hình 9 6ӕ SURWHLQ
WuP WKҩ\
trong cùng superfamily theo
NtFKWKѭӟFFӱDVәZ
(b=10, H=3 và l=15)
6ӕSURWHLQWuPWKҩ\
6ӕSURWHLQWuPWKҩ\
6ӕOѭӧQJNFXWRII
.KRҧQJFiFKH
*LiWUӏFKXҭQKRiE
Hình 10 6ӕ SURWHLQ
WuP WKҩ\
trong cùng superfamily theo
QJѭӥQJNKRҧQJFiFKH
(w=3, b=10, và l=15)
Hình 11 6ӕ SURWHLQ
WuP WKҩ\
trong cùng superfamily theo
JLiWUӏFKXҭQKRiE
(w=3, H=2.5, và l=15)
13
d) ĈiQKJLi YjQKұQ[pW
7URQJKuQKFKRWKҩ\VӕSURWHLQWuPÿѭӧF trong cùng superfamily ÿһWÿѭӧFPӭF
WUXQJEuQKNKRҧQJ YӟLVӕFXWRIIWӯÿӃQNӃWTXҧQj\FKRWKҩ\KLӋXTXҧWuP
NLӃPJҫQWѭѫQJÿѭѫQJYӟL PSIST. .ӃWTXҧӣhình 9 FKRWKҩ\WKXұWWRiQKRҥWÿӝQJәQ
ÿӏQK YӟLNtFKWKѭӟFFӱDVәNKRҧQJ Wӯ3 ÿӃQ QӃXYѭӧWTXDNKRҧQJQj\WKuKLӋXTXҧ
JLҧPWKҩ\U}GRcác sai Vӕ SKiWVLQKWURQJTXiWUuQKU~WÿһFWUѭQJYjFKXҭQKRiYHFWRU.
&yWKӇFҧLWKLӋQYҩQÿӅQj\EҵQJFiFKJLDWăQJJLiWUӏFKXҭQKRi QKѭNӃWTXҧWKӇKLӋQ
WURQJKuQKWX\QKLrQYLӋFQj\VӁGүQÿӃQWăQJWKӡLJLDQ[ӱ OêYjNK{QJJLDQOѭXWUӳ
các vector ÿһc WUѭQJ.
.ӃW TXҧ FKR WKҩ\ KLӋX VXҩW FӫD WKXұW WRiQ JҫQ WѭѫQJ ÿѭѫQJ YӟL 36,67 Yj Fy
SKҫQWӕWKѫQ3UR*UH66WX\QKLrQQӃX[pWYӅPһWOѭXWUӳWKuWKXұWWRiQ36,67FҫQQKLӅX
NK{QJJLDQKѫQFKRFk\KұXWӕQӃXSKҧLFKҥ\WUrQWұSGӳOLӋXOӟQYjWKDRWiFWuPNLӃP
FNJQJSKӭFWҥSKѫQQKѭQJ FyÿӝFKtQK[iFFDRKѫQWKXұWWRiQbài báo ÿӅ[XҩW
7KXұWWRiQÿӅ[XҩWFyQKӳQJÿLӇPWӕW
&k\FKӍPөFÿѭӧF[k\GӵQJPӝWOҫQYjKLӋXFKӍQKQKLӅXOҫQWURQJTXi
WUuQK WuP NLӃP. Ĉӝ SKӭF WҥS WuP NLӃP FKXӛL 4 ÿӝ GjL O WUrQ Fk\ FKӍ
PөF FKLӅX FDR K Oj 2k*(h-l)*b), k Oj Vӕ WUXQJ EuQK FiF QKiQK Fy
WUQJJLiWUӏӣPӭFi, EOjVӕQKiQKWҥLJӕF.
9LӋF JӝS QKiQK NKL KLӋX FKӍQK Fk\ VӁ FKR SKpS WuP WKҩ\ FQJ O~F
QKLӅXFҩXWU~FWKRҧWUX\YҩQQKiQKVDXNKLWuPWKҩ\ÿѭӧFORҥLEӓNKӓL
câ\ÿӇJLҧPNK{QJ JLDQWuPNLӃPWUrQFiFPӭFFDRKѫQ.
7KXұWWRiQFKRSKpSWuPWUrQWRjQEӝNK{QJJLDQGӳOLӋXFҩXWU~F.
4. .ӃWOXұQ
7URQJEjLEiRQj\WUuQKEj\PӝWKѭӟQJWLӃSFұQWURQJYLӋFOұSFKӍPөFFKRFѫVӣ
GӳOLӋXFҩXWU~F EұFED FӫD SURWHLQGӵDWUrQU~WWUtFKÿһFWUѭQJFӫD protein theo WKXұW
WRiQ36,67YjÿӅ[XҩWWKXұWWRiQWuPNLӃPWUrQFҩXWU~FFk\FKӍPөF%jLEiRFNJQJWUuQK
Ej\YӅFiFQJXӗQGӳOLӋXFҩXWU~FEұFEDFӫDSURWHLQÿӅ[XҩWP{KuQKFѫVӣGӳOLӋXFKR
YLӋFOѭXWUӳSKөFYөWKDRWiFOұSFKӍPөFYjWUDFӭXWK{QJWLQFiFFҩXWUúc protein này.
'ӳOLӋXGQJFKRFiFWKӱQJKLӋPÿѭӧFU~WWUtFKWӯ³VLrXKӑ´FӫD6&23YjFiFNӃW
TXҧFKRWKҩ\ÿӝ FKtQK[iFWѭѫQJÿӕLFDRYjKLӋXTXҧNKLiSGөQJFiFWKXұWWRiQÿӅ[XҩW
WUrQGӳOLӋXWKӱQJKLӋP
14
dăŝůŝҵƵƚŚĂŵŬŚңŽ
[1] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N.
6KLQG\DORYDQG3(%RXUQH³7KH3URWHLQ'DWD%DQN´1XFOHLF$FLGV5HVHDUFK
vol. 28, 2000, pp. 235-242.
[2@ $* 0XU]LQ 6( %UHQQHU 7 +XEEDUG DQG & &KRWKLD ³6FRS $ 6WUXFWXUDO
Classification of Proteins Database for the Investigation of Sequences and
6WUXFWXUHV´-0RO%LROSS-540.
[3] C.A. Orengo, A.D. Michie, D.T. Jones, M.B. Swindells, and J.M. Thornton,
³&$7+ - A Hierarchic Classification of Protein Domain SWUXFWXUHV´ 6WUXFWXUH
vol. 5, no. 8, 1997, pp. 1093-1108.
[4@ / +ROP DQG & 6DQGHU ³7KH )663 'DWDEDVH )ROG &ODVVLILFDWLRQ %DVHG RQ
Structure - 6WUXFWXUH $OLJQPHQW RI 3URWHLQV´ 1XFOHLF $FLGV 5HVHDUFK YRO
1996, pp. 206-210.
[5] Can T. Kahveci T. Singh A.K. , A. and Y.F Wang, ³Progress: Simultaneous
searching of protein databases by sequence and structure´, Pacific Symp.
Bioinformatics, pages 264±275, 2004.
[6] T. Can and Y.Wang, ³&766 D UREXVW DQG HI¿FLHQW PHWKRG IRU protein structure
alignment based on local geometrical and biological features´ IEEE Computer
Society Bioinformatics Conference (CSB), pages 169±179, 2003.
[7] Mohammed J. Zaki Feng Gao, ³PSIST: Indexing Protein Structures using Suffix
Trees´ in IEEE Computational Systems Bioinformatics Conference, Palo Alto,
CA, August 2005.
[8] A. Salah Tarek F. Gharib and Abdel-Badeeh M.Salem, ³PSISA: an Algorithm for
Indexing and Searching Protein Structure using Suffix Arrays´ In The WSEAS
International Conference on Computers, pages 775±780, 2008.
[9] F. Smith and M. Waterman, ³,GHQWL¿FDWLRQRIFRPPRQ molecular subsequences´
J. Mol. Biol., (147):195±197, 1981.
15