You are on page 1of 22

Thut Ton Cy Quyt nh C4.

5
Sinh vin: Lu Cng T Ngi hng dn: V Tin Thnh

Outline
Thut ton Cy quyt nh
nh ngha. Xy dng cy quyt nh. c im cy quyt nh.

Thut ton C4.5


Lch s Pseudocode, cc cng thc v v d. C ch chng qu va d liu, chng thiu d liu thuc tnh. Chuyn sang lut ng dng. C4.5 v See5/C5.0 Hng pht trin.
2

Thut ton cy quyt nh


nh ngha: Cy quyt nh l biu quyt nh pht trin c cu trc dng cy:
Gc: Node trn cng cy. Node trong: biu din 1 kim tra hoc 1 thuc tnh n Node l: biu din lp. Nhnh: Kt qu kim tra ca node trn
Gc

Node L

Node Trong Nhnh

Node L

Node L

V d cy quyt nh
4

Thut ton cy quyt nh


Xy dng cy quyt nh gm 2 bc:
Pht trin cy quyt nh: i t gc, n cc nhnh, pht trin quy np theo hnh thc chia tr.
Chn thuc tnh tt nht bng mt o nh trc Pht trin cy bng vic thm cc nhnh tng ng vi tng gi tr ca thuc tnh chn Sp xp, phn chia tp d liu o to ti node con Nu cc v d c phn lp r rng th dng. Ngc li: lp li bc 1 ti bc 4 cho tng node con

Ct ta cy: nhm n gin ha, khi qut ha cy, tng chnh xc


5

Thut ton cy quyt nh


VD: thut ton Hunt s dng trong C4.5, CDP...
S={S1,S2,,Sn} l tp d liu o to C={C1,C2,,Cm} l tp cc lp TH1: Si (i=1n) thuc v Cj => Cy quyt nh l 1 l ng Cj. TH2: S thuc v nhiu lp trong C.
Chn 1 test trn thuc tnh n c nhiu gi tr O={O1,..Ok} (k thng bng 2). Test t gc ca cy, mi Oi to thnh 1 nhnh, chia S thnh cc tp con c gi tr thuc tnh = Oi. quy cho tng tp con => cy quyt nh gm nhiu nhnh, mi nhnh tng ng vi Oi.

Thut ton cy quyt nh


im mnh ca cy quyt nh:
Sinh ra cc quy tc hiu c: chuyn i c sang ting Anh hoc SQL. Thc thi trong lnh vc hng quy tc. D dng tnh ton trong khi phn lp. X l vi thuc tnh lin tc v ri rc. Th hin r rng nhng thuc tnh tt nht: phn chia d liu t gc.

im yu ca cy quyt nh:
D xy ra li khi c nhiu lp: do ch thao tc vi cc lp c gi tr dng nh phn. Chi ph tnh ton t hc: do phi i qua nhiu node n node l cui cng

Thut ton C4.5


L s pht trin t CLS v ID3. ID3 (Quinlan, 1979)- 1 h thng n gin ban u cha khong 600 dng lnh Pascal Nm 1993, J. Ross Quinlan pht trin thnh C4.5 vi 9000 dng lnh C. Hin ti: phin bn See5/C5.0. T tng thut ton: Hunt, chin lc pht trin theo su.

Thut ton C4.5


Pseudocode:
Kim tra case c bn Vi mi thuc tnh A tm thng tin nh vic tch thuc tnh A Chn a_best l thuc tnh m o la chn thuc tnh tt nht Dng a_best lm thuc tnh cho node chia ct cy. quy trn cc danh sch ph c to ra bi vic phn chia theo a_best, v thm cc node ny nh l con ca node
(1) ComputerClassFrequency(T); (2) if OneClass or FewCases return a leaf; Create a decision node N; (3) ForEach Attribute A ComputeGain(A); (4) N.test=AttributeWithBestGain; (5) if (N.test is continuous) find Threshold; (6) ForEach T' in the splitting of T (7) If ( T' is Empty ) Child of N is a leaf else (8) Child of N=FormTree(T'); (9) ComputeErrors of N; return N
9

Thut ton C4.5


o la chn thuc tnh tt nht: information gain v gain ratio
Tn sut cc case Sj thuc v gi tr phn lp Cj |Sj| RF (Cj, S) = |S| Ch s thng tin cn thit cho s phn lp: I(S)

S={S1,S2,,Sx)
10

Thut ton C4.5


o la chn thuc tnh tt nht:
Information gain: Test B chia S={S ,S ,,S )
1 2 t

Test B s c chn nu c G(S, B) t gi tr ln nht. Thng tin tim nng (potential information) ca bn thn mi phn hoch:

Gain ratio = G(S, B) / P(S, B) lnnht => chn test B

11

Thut ton C4.5


o o o o o V d: s1 (yes) 9 case,s2 (no) 5 case I(S) = I(s1,s2) = I(9, 5) = 0.940 A = age =>S={S1,S2,S3} S1 (age<30), S2(30-40), S3 (>30). I (S1): s11(yes &<30) =2, s22(no&<30) =3 o I (S1) = (s11,s21) =0,971 o Gain (S, age) = =I(s1,s2) |Si| / |S|* I(Si) = 0.246 o A = income: Gain (S, A) = 0.029 o A = student: Gain (S, A) = 0.151 o A = credit_rating: Gain (S, A) = 0.048 o A= age c Gain ln nht =>chn lm thuc tnh pht trin ti node ang xt

12

Thut ton C4.5


X l qu va d liu:
Cho php cy qu va d liu, sau ct ta cy.

X l nhng gi tr thiu:
L case c thuc tnh khng c gi tr. information gain v potential information: S0 = case thuc S c gi tr thuc tnh = null.

Nu test B c chn, C4.5 phn chia cc case trong S0 v cc tp con Si

13

Thut ton C4.5


Chuyn i sang lut: ct ta cy
Dng lut: if A and B and C then class X. Khng tha mn iu kin chuyn v lp mc nh. Xy dng lut: 4 bc
Mi ng i t gc n l l mt lut mu. n gin lut mu bng cch b dn iu kin m khng nh hng ti chnh xc ca lut. Cc lut ct ta c nhm li theo gi tr phn lp to ra cc tp con. Vi mi tp con, xem xt la chn lut ti u ha chnh xc d on ca lp gn vi tp lut . Sp xp cc tp lut trn theo tn s li. Lp mc nh c to ra bng cch xc nh cc case trong tp S khng cha trong cc lut hin ti v chn lp ph bin nht trong cc case lm lp mc nh. c lng nh gi: cc lut c c lng trn ton tp S, loi b lut lm gim chnh xc ca s phn lp.

Hon thnh: 1 tp cc quy tc n gin c la chn cho mi lp


14

Thut ton C4.5


c im C4.5:
Chim thi gian s dng CPU v b nh ln:
vd vi 10k ti 100k case, to cy quyt nh tng t 1,4s ln 61s, to lut tng t 32s ln 9,715s.

S dng c ch lu d liu thng tr trong b nh => ng dng vi database nh ( tn s li lp li 4% vi database 20000 cases). C c ch x l thiu, li hoc qu va d liu. Lut to ra n gin.
15

Thut ton C4.5


ng dng vo bi ton phn lp d liu: Bc 1 (Hc): xy dng m hnh m t tp d liu; khi nim bit
Input: tp d liu c cu trc c to m t bng cc thuc tnh Output: Cc lut IfThen

Bc 2 (Phn loi): da trn m hnh xy dng phn lp d liu mi: i t gc n cc nt l nhm rt ra lp ca i tng cn xt.
16

Thut ton C4.5


ng dng vo bi ton phn lp d liu:
X l vi d liu thuc tnh lin tc:
S dng kim tra dng nh phn: value(V) < h vi h l hng s ngng (threshold) h c tm bng cch: Quick sort sp xp cc case trong S theo cc gi tr ca thuc tnh lin tc V ang xt =>V = {v1, v2, , vm} hi = (vi + v(i+1))/2. Test phn chia d liu:V <= hi hay V>hi => chia V thnh V1={v1,v2,, vi} v V2 = {vi+1, vi+2, , vm} v c hi (i=1m-1) Tnh Information gain hay Gain ratio vi tng hi. Ngng c gi tr ca Information gain hay Gain ratio ln nht s c chn lm ngng phn chia ca thuc tnh .

17

C4.5 v C5.0
Lut: C5.0 chnh xc hn, nhanh hn, tn t b nh hn.
Blue: C5.0

Cy quyt nh: nhanh hn, nh hn

18

C4.5 v C5.0
Boost: to v kt hp nhiu lp phn loi tng chnh xc d on
Blue: C5.0

Kiu d liu mi: vd ngy,thng

19

Hng nghin cu
Cy n nh:
Tn s li ca cy c xy dng t data case c cu trc thp hn nhiu so vi data case khng nhn thy c. VD: vi 20k case c cu trc, t l li l 4%,
Cng 20k case v c 1 case khng c kim tra, t l li 11,7%

Yu cu t ra: xy dng cy m t l li l xp x nhau cho c 2 trng hp.

Phn ly cy phc tp:


C th chia ct cy phc tp thnh cc cy nh, n gin m kt qu khng i ?
20

Ti liu tham kho


Nguyn Th Thy Linh (2005). Thut ton phn lp cy quyt nh, Kha lun tt nghip i hc, Trng i hc Cng ngh, 2005. [WKQ08] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu , Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg (2008). Top 10 algorithms in data mining, Knowl Inf Syst (2008) 14:137 http://rulequest.com/see5-comparison.html http://en.wikipedia.org/wiki/ID3_algorithm http://en.wikipedia.org/wiki/C4.5_algorithm http://en.wikipedia.org/wiki/Decision_tree
21

Cm n thy, anh ch v cc bn theo di!

22

You might also like