You are on page 1of 4

Kin trc Big Data - m hnh pht trin ca tng

lai
Hin nay, Big Data - thut ng ht sc ph bin, cng l xu hng pht trin mi
ca ngnh cng ngh thng tin, v ang hin hu rt nhiu ngnh ngh
khc nhau. S ph bin ny i km vi tc tng trng chng mt ca lng
d liu khng l c s dng khp ni trn ton th gii.
Hadoop - bin nhng gi thuyt v d liu ln thnh lc hu / Ba pht minh mi trong IoT v c
hi ca FPT Software / Pht trin sn phm theo chin lc 'mobile-first' / S hp hi ca dch v
in ton m my ni b
Ni n Big Data, khng th khng nhc ti Apache Hadoop - mt framework gip nhng nh pht
trin xy dng c cc ng dng cng nh h thng phn tn. Khi u ca Hadoop l ni lu
tr d liu vi tin cy cao kt hp vi kh nng x l hng lot da trn framework MapReduce
-mt framework c kh nng m rng v tnh ton song song, gi y Hadoop cn c b sung
thm mt s thnh phn c kh nng x l thi gian thc, x l vn bn v h tr cng c tm kim
nh Impala hay Apache Solr.
Vic xy dng h thng vi Hadoop cng ngy cng tr nn d dng, vi thao tc n gin l ci
t CDH l mi thnh phn trong h sinh thi ca Hadoop u sn sng. Tuy nhin, c th ca
vic thit k, xy dng h thng ny nh th no vn lun l cu hi ln cho cc chuyn gia cng
ngh. Vi s lng ng dng c xy dng trn Big Data ht sc phong ph, mt trong nhng
ti c sc ht ln chnh l: Kin trc Big Data.
Kin trc Big Data
Kin trc Big Data c xy dng da trn mt tp hp cc k nng c th gip pht trin mt
lung x l d liu ng tin cy, c kh nng m rng v t ng ha. c tp hp cc k nng
i hi phi c kin thc nht nh v tng thnh phn ca h thng, t vic thit k cc cm
phn cng cho n vic thit lp ci t cho ton b qu trnh x l ca Hadoop. S di y
m t mt cch khi qut v mt h thng nh vy:
T s trn c th thy lung x l chnh ca h thng s tip nhn u vo l d liu th v tr
v nhng d liu c gi tr. Xuyn sut qu trnh , nhng k s Big Data s l ngi la chn
nhng cng ngh s dng bn trong; la chn cch d liu c lu tr, c truy xut t bn
trong, bn ngoi; cng nh la chn cng c x l d liu Nh vy c th hiu rng nhng
k s Big Data chnh l nhng ngi thit k v trin khai kin trc Big Data.
Tip theo, chng ta s i vo tng phn chnh trong h thng bn trn xem vai tr ca chng
trong vic xy dng lung x l d liu l g.
Xy dng cm phn cng
Xy dng cm phn cng l mt vn phc tp, khi m vic thit k thng c thc hin sau
khi xc nh c yu cu bi ton, m ban u yu cu thng cha r rng. Hu ht cc nh
cung cp dch v u c hng dn c th v vic la chn phn cng sao cho hp l nht. Thng
thng, mt cm phn cng c khuyn ngh s c 2 CPU vi 4 n 8 li cho mi CPU, t nht
48GB cho n 512GB RAM cho vic lu tr d liu tm thi, t nht 6 cho n 12 cng lu tr
nhng d liu v thit lp Nu vn cn gp kh khn khi xy dng cm phn cng, ta lun c th

th xy dng trc h thng bng nhng dch v trn cloud cho n khi xc nh r c yu cu
bi ton.
Truyn ti
Sau khi c c cm phn cng tha mn yu cu, iu tip theo cn quyt nh l d liu s c
truyn ti bng cch no. Trn thc t c hai phng php ch yu gii quyt bi ton ny l
truyn ti hng lot (batch ingest) v truyn ti theo event (event ingest). Phng php u tin
thch hp cho d liu dng file v d liu c cu trc, cn phng php sau th thch hp vi cc
d liu i hi x l thi gian thc nh cc d liu v giao dch hay logging.
Truyn ti hng lot
Khi truyn d liu t mt ngun d liu c cu trc nh RDBMS ngi ta thng la chn Apache
Sqoop. Sqoop h tr ngi dng rt tt trong vic chuyn d liu t RDBMS sang Hadoop, t
chuyn mt phn cho n chuyn ton b. Sqoop s dng framework MapReduce v tch hp cc
thnh phn ca JDBC trong nhiu h c s d liu ph bin.
C mt phng php truyn ti hng lot khc phc tp hn l s dng file. C kh nhiu cch
thc hin c phng php nhng hu nh khng c ai p dng. Vn phc tp nm
ni lu tr file cng nh API cn s dng ti file. Trn thc t ngi ta thng s dng
phng php truyn ti theo event trnh c vic phi ti mt lng ln file nh vy.
Truyn ti theo event
i vi truyn ti d liu theo event th Apache Flume l mt cng c tt, n c cc agent h tr
vic truyn d liu dng event t mt h thng sang mt h thng khc, c th l HDFS, Spark hay
HBase. Flume c th nghim k lng trn cc cm phn cng ln v t ra rt ng tin cy.
Phn phc tp ca Flume l vic cu hnh cc agent v cu trc ca Flume mt cch ng n.
Lu tr
Khi d liu c truyn ti ti ni, vn cn mt vn cn phi quan tm trc khi tnh ti vic
x l, l lu tr d liu. y khng ch l vn lu tr d liu u m d liu cn phi c
nh dng thch hp, c kch thc thch hp v cn c quyn truy nhp thch hp i vi d liu
.
nh dng lu tr
nh dng nh th no l thch hp ph thuc vo vic ng dng x l hng lot hay x l thi gian
thc. i vi x l hng lot, nhng nh dng file nh SequenceFile hay Avro u ph bin v
thch hp. i vi nhng ng dng thi gian thc, c mt ci tn mi ni gn y l Apache
Parquet, n c cu trc tng t nh nhng c s d liu dng bng nhng cho php truy xut v
x l nhng tp d liu kch thc ln mt cch rt hiu qu.
Ngoi ra cng cn quan tm n vic d liu s c x l nh th no sau mt khong thi gian
no . Nn c mt c ch c th lu tr nhng d liu c vo mt khu vc khc hoc vi mt
nh dng khc t tn b nh hn, nhm trnh vic lng ph ti nguyn h thng.
Phn vng d liu
Cu hi tip theo l nhng d liu s c phn vng nh th no v vi kch thc ra sao?
Hadoop h tr rt tt vic qun l mt s lng nh cc file c dung lng ln. Chnh v vy, vic
thit k ra mt h thng m to ra nhiu d liu b trong HDFS s ch lm cho tc hot ng ca

NameNode tr nn chm chp. Tt nhin, y vn l mt hng i c th s dng c nhng s


i hi phi thm mt bc trung gian c nhim v ghp cc d liu c kch thc nh li vi nhau.
Trong HBase, qu trnh phn vng d liu c thc hin mt cch ngm nh bng vic chia d
liu vo nhng hng lin k nhau trong bng v sp xp chng theo kha nh trc. Cn i
vi HDFS th vic phn vng phi c tnh ton trc, bi vy bn s cn phn tch cu trc ca
mt mu d liu quyt nh xem phn vng th no l hp l nht. Kinh nghim ch ra rng tt
nht nn trnh vic to ra cc file c kch thc nh (l do c nhc n trn). C th mi file
nn c kch thc ti thiu l 1GB hoc thm ch ln hn ty thuc vo b d liu cn c phn
vng.
iu khin vic truy nhp
Vic cui cng cn quan tm trong qu trnh lu tr d liu l thit lp d liu theo mt c ch no
nhiu tin trnh khc nhau c th truy nhp c m khng lm nh hng n cc tin trnh
cn li cng nh tnh an ton ca d liu.
Vic ny khng ch n gin l mi tin trnh c t th mc ny v ghi ra mt th mc khc. Nu
d liu c chia s truy cp cho nhiu bn lin quan th bn cn nh ngha ra mt s iu
khin truy nhp mt cch cht ch kim sot xem ai s c truy cp vo d liu no. C mt
cch gii quyt vn l to ra nhng th mc c gn nhn thi gian i vi mi tin trnh
ring bit, iu ny s m bo mi tin trnh k c c thc hin song song cng khng lm nh
hng n d liu ca tin trnh cn li.
X l d liu
Sau khi d liu c lu tr, bc tip theo ca ton b qu trnh s l t ng x l nhng d
liu .
Bin i d liu
Vic bin i d liu y khng c ngha l s lm mt mt mt phn no ca d liu m l
tin trnh gip cho h thng x l d liu mt cch c hiu qu hn. V d vic kim tra d liu c
x l theo cch no v vi tn sut nh th no nhm gim bt vic phi ghi li d liu nhiu ln.
Phn tch d liu
Ch c nhc ti nhiu nht trong vic phn tch d liu Big Data l hc my, l qu trnh
xy dng nn nhng m hnh ton hc t c th gii quyt cc bi ton khuyn ngh, phn
cm hay phn loi i vi nhng d liu mi. V d nh nhng bi ton nh gi ri ro, pht hin
li, hay n gin nh lc th rc. Tuy nhin nhng tin trnh phn tch khc ngoi hc my nh: xy
dng mi tng quan hay bo co d liu vn cn rt ph bin.
Tuy nhin d bng phng php ny hay phng php khc, cui cng nhng tin trnh bin i
cng nh phn tch u cn c thc hin mt cch t ng.
Lung d liu
Trc khi d liu c x l mt cch t ng ta cn kt hp nhng cng c v nhng tin trnh
trong mi thnh phn ca h thng thnh mt lung d liu thng nht. C hai kiu lung d liu
nh vy l micro v macro.
Lung d liu kiu micro cho php thc hin tng phn nh trong ton b qu trnh x l ln.
Nhng cng c phc v cho mc ch ny c th k n Morphlines, Crunch, hay Cascading.

Morphlines thc hin tng cm tin trnh nh ln mi bn ghi trong qu trnh x l; trong khi
Crunch vi Cascading nh ngha ra mt lp tru tng cho mi mt vng d liu nh. Tuy nhin
nhng tin trnh trong Crunch hay Cascading cha th kt hp c vi nhau to ra mt lung
tin trnh phc tp hn, iu m c th thc hin d dng i vi lung d liu kiu macro.
Apache Oozie l mt trong s nhng cng c nh vy. N c kh nng nh ngha mt lung tin
trnh trong cc tin trnh con c th c thc hin song song hay kt hp mt cch rt linh hot.
Oozie cn h tr mt thnh phn ng vai tr nh mt my ch s kim sot v thng k s liu v
nhng tin trnh ang v s c thc thi. i vi mt tin trnh n l hay i vi lung d liu
kiu micro, vic tin trnh s ch c nh ngha ch khng th c thc thi mt cch t ng,
vic khi chy mt tin trnh no s phi thc hin mt cch th cng. Cn i vi Oozie, cc
lung tin trnh c th c thc thi t ng ti mt thi im c th vi mt tn sut c th c
nh ngha ra bi cc coordinators ca n.
S di y s tng kt li mt vi cng c cng nh khi nim c th c s dng trong mi
thnh phn ca lung d liu m ta cp trn:
H sinh thi phong ph ca Hadoop hin ti cung cp hu ht nhng cng c phc v cho
vic xy dng cng nh t ng ha nhng h thng x l nhng lung d liu ln. Bn cnh ,
nhng cng c ny cn h tr ht sc c lc trong qu trnh kim th v trin khai h thng, thc
y thng mi ha sn phm. Tuy Hadoop ngy mt ln mnh v pht trin nhng tim nng ca
framework ny vn cn cha c khm ph ht, bi vy nhng k s Big Data giu kinh nghim
ang l nhng nhn t v cng quan trng v cn thit trong vic thc y Hadoop pht trin v tin
xa hn na trong lnh vc Big Data. y cng l l do cc chuyn gia cng ngh ca FPT
Software u t nhiu hn vo vic nghin cu su hn v ng dng Hadoop vo qu trnh pht
trin cc d n tht.
c thm thng tin, vui lng truy cp: The-meanings-of-big-data-engineer-and-big-dataarchitecture
V Thanh Hi
FSB, FPT Software

You might also like