You are on page 1of 14

H Tn c Thng

Hc My

Bi bo co v Job Salary Prediction


s dng LATEX

Thnh vin:
Ging vin hng dn:
Trn ng Trnh - 5100324
TS. Nguyn Thanh Hin
Nguyn Th M Dung - 51003238

Th 2, 09/05/2016
Contents

1 GII THIU V NGN NG R 2


1.1 Khi nim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Gii thiu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 R l g? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Ti v Ci t R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Vn phm Ngn ng R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Cch t tn trong R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 H tr trong R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 GII THIU V KAGGLE 4

3 GII THIU V CUC THI JOB SALARY PREDICTION 5

4 JOB SALARY PREDICTION CHY TRN R 6


4.1 c d liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Xy dng Top Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.1 Tp d liu train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.2 Tp d liu test (tng t tp d liu train) . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Xy dng Top Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.1 Tp d liu train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.2 Tp d liu test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Xy dng Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1
Chapter 1

GII THIU V NGN NG R

1.1 Khi nim


1.1.1 Gii thiu
- Phn tch s liu v biu thng c tin hnh bng cc phn mm thng dng nh SAS, SPSS, Stata,
Statistica, v S-Plus. y l nhng phn mm c cc cng ty phn mm pht trin v gii thiu trn
th trng khong ba thp nin qua, v c cc trng i hc, cc trung tm nghin cu v cng ti
k ngh trn ton th gii s dng cho ging dy v nghin cu. Nhng v chi ph s dng cc phn
mm ny tung i t tin (c khi ln n hng trm ngn -la mi nm), mt s trng i hc cc
nc ang pht trin (v ngay c mt s nc pht trin) khng c kh nng ti chnh s dng
chng mt cch lu di. Do , cc nh nghin cu thng k trn th gii hp tc vi nhau pht
trin mt phn mm mi, vi ch trng m ngun m, sao cho tt c cc thnh vin trong ngnh thng
k hc v ton hc trn th gii c th s dng mt cch thng nht v hon ton min ph.
- Nm 1996, trong mt bi bo quan trng v tnh ton thng k, hai nh thng k hc Ross Ihaka v Robert
Gentleman thuc Trng i hc Auckland, New Zealand pht ho mt ngn ng mi cho phn tch
thng k m h t tn l R. Sng kin ny c rt nhiu nh thng k hc trn th gii tn thnh v
tham gia vo vic pht trin R.
- Cho n nay, qua cha y 10 nm pht trin, cng ngy cng c nhiu nh thng k hc, ton hc, nghin
cu trong mi lnh vc chuyn sang s dng R phn tch d liu khoa hc. Trn ton cu, c
mt mng li hn mt triu ngi s dng R, v con s ny ang tng rt nhanh.

1.1.2 R l g?
Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch thng k v v biu . Tht ra, v bn
cht, R l ngn ng my tnh a nng, c th s dng cho nhiu mc tiu khc nhau, t tnh ton n gin,
ton hc gii tr (recreational mathematics), tnh ton ma trn (matrix), n cc phn tch thng k phc
tp. V l mt ngn ng, cho nn ngi ta c th s dng R pht trin thnh cc phn mm chuyn mn
cho mt vn tnh ton c bit.

1.2 Ti v Ci t R
- s dng R, vic u tin l chng ta phi ci t R trong my tnh ca mnh. lm vic ny,
ta phi truy nhp vo mng v vo website c tn l Comprehensive R Archive Network (CRAN) sau y:

https://cran.r-project.org/

- Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my tnh. lm vic ny, chng ta


ch n gin nhn chut vo ti liu trn v lm theo hng dn cch ci t trn mn hnh. y l mt
bc rt n gin, ch cn 1 pht l vic ci t R c th hon tt.

2
1.3 Vn phm Ngn ng R
- Vn phm chung ca R l mt lnh (command) hay function. M l hm th phi c tham s, cho nn
theo sau hm l nhng tham s m chng ta phi cung cp:

i tng <- hm(thng s 1, thng s 2, . . . , thng s n)

- bit mt hm cn c nhng thng s no, chng ta dng lnh args(x), (args vit tt ch arguments)
m trong x l mt hm chng ta cn bit.

- Mt s k hiu hay dng trong R l:

x == 5 : x bng 5
x != 5 : x khng bng 5
y < x : y nh hn x
x > y : x ln hn y
z <= 7 : z nh hn hoc bng 7
p >= 1 : p ln hn hoc bng 1
is.na(x) : C phi x l bin s trng khng (missing value)
A & B : A v B (AND)
A | B : A hoc B (OR)
! : Khng l (NOT)

- Vi R, tt c cc cu ch hay lnh sau k hiu # u khng c hiu ng, v # l k hiu dnh cho ngi
s dng thm vo cc ghi ch.

1.3.1 Cch t tn trong R


- t tn mt i tng (object) hay mt bin s (variable) trong R kh linh hot, v R khng c nhiu gii
hn nh cc phn mm khc. Tn mt object phi c vit lin nhau (tc khng c cch ri bng mt
khong trng). Chng hn nh R chp nhn myobject nhng khng chp nhn my object.
- i khi tn myobject kh c, cho nn chng ta nn tc ri bng du chm. Nh my.object.

- Mt iu quan trng cn lu l R phn bit mu t vit hoa v vit thng. Cho nn My.object khc
vi my.object.

1.3.2 H tr trong R
Ngoi lnh args() R cn cung cp lnh help() ngi s dng c th hiu vn phm ca tng hm. Chng
hn nh mun bit hm lm c nhng thng s (arguments) no, chng ta ch n gin lnh:
> help(lm) hay > ?lm

3
Chapter 2

GII THIU V KAGGLE

- c thnh lp vo nm 2010, Kaggle l nn tng trc tuyn phc v cho vic t chc cc cuc thi khai
thc d liu v xy dng m hnh d bo. Mt cng ty no c th phi hp vi Kaggle a ln mng
mt m d liu cng vi bi ton t hng cng ng cc nh khoa hc ca site ny xut gii php.

- im quan trng l cc th sinh" c quyn chnh sa ti lui gii php ca mnh, thc y h v cng
ng n lc tm kim gii php tt hn cho n tn hn cht.
- mi cng ty nh MasterCard, Pfizer, Allstate, Facebook v c NASA u tham gia t chc cuc thi
trn Kaggle. V d nh Cng ty General Electric ti tr cuc thi vit phn mm thit lp ng bay hiu
qu hn cho hng hng khng; hay cng ty Practice Fusion (chuyn v cng ngh sc khe) ti tr mt
cuc thi khc nhm xc nh cc bnh nhn b bnh tiu ng loi 2 da trn h s y t
- Gii thng cho gii php thng cuc trong khong t 3.000 n 250.000 USD. C bit c gii thng tr
gi n 3 triu USD c Heritage Provider Network trao thng.
- Mi ngi u c c hi. Bt k th sinh no, d c xa xi cch tr n u i na u c th nh
gi ti nng ca mnh so vi nhng ngi ng u cng lnh vc. Hn na, trong cc din n ca
Kaggle, cc th sinh c th trao i v trau di k nng. Mt lp trnh vin gii c th tng th hng
nhanh chng bng cch ghi im tt trong hai hoc ba cuc thi.

- mc no , Kaggle l mt dng "crowdsourcing", khai thc b no ton cu gii quyt mt vn


ln no . Dng khai thc ngun lc m ng ny c c chc nm nay hoc hn, t nht l t
thi Wikipedia (hoc xa hn, t thi Linux, v.v..). Cc cng ty nh TaskRabbit v oDesk to cng n
vic lm cho m ng nhiu nm nay. Nhng Kaggle hn th. Th nht, nhng ngi tham gia Kaggle
lm vic khng ch v mc ch thin nguyn: h mun ginh chin thng v mun ci thin th hng
ca mnh c c hi tt hn trn th trng vic lm. Th hai, Kaggle khng ch to ra cng n vic lm
m cn to ra th trng vic lm mi cho cc chuyn gia. Khng ging nh cc lao ng thi v truyn
thng, thnh vin Kaggle l nhng ngi sao.

- Th hng Kaggle tr thnh mt thc o quan trng trong gii khoa hc d liu. Cc cng ty nh
American Express v New York Times bt u lit k th hng Kaggle nh mt chng ch cn thit
trong qung co tm kim nhn ti ca mnh. N khng ch l huy hiu m cn l ch s v nng lc, c
ngha quan trng v gi tr hn cc tiu chun truyn thng v trnh v chuyn mn. Bng cp t cc
trng i hc danh ting v l lch lm vic ti nhng cng ty tn tui nh IBM c th khng c ngha
bng im s Kaggle. Ni cch khc, cng vic c th o m v th hng ca bn trn th trng gi
tr hn ni bn lm vic. Bn CV (Curriculum Vitae l lch lm vic) ri s khng cn cn na?
- Kaggle to nn mt loi th trng lao ng mi, ni m k nng c tch bch khi nhng y nhim
th khng tin cy l bng cp v l lch. y thc s l bc thay i ln.

4
Chapter 3

GII THIU V CUC THI JOB


SALARY PREDICTION

- Thng thng khi ng tuyn vic lm, ngi s dng lao ng thng b qua vic cp n mc lng.
V khi mt c nhn tm kim mt cng vic, iu ny t ra mt tnh hung kh x, lm h c nguy c
lng ph thi gian qu bu vo mt cng vic vi mc lng thp, hoc b qua qung co vi nguy c b
qua mt c hi vic lm tuyt vi.
- Adzuna l mt cng ty Rao vt Anh vi a s cc qung co v vic lm. V hn mt na trong s
qung co khng lit k mc lng. cung cp dch v tt hn, Adzuna mun cung cp mt s c
tnh v mc lng cho cng vic khi m nh tuyn dng khng lit k. kt thc iu ny, Adzuna
t chc cuc thi Kaggle vi mc tiu nng cao s d on mc lng ca cng vic.
- M hnh thnh cng s kt hp mt s phn tch v tc ng ca vic a cc t kha hoc cm t khc
nhau, cng nh cch s dng trng d liu c cu trc ging nh a im, thi gian hoc cng ty. Mt
s d liu c cu trc hin th c suy ra bi cc quy trnh ring ca Adzuna, da vo ni qung co n
t u hoc ni dung ca n, v c th khng ng nhng li l i din ca cc d liu thc t.

- Bn s c cung cp mt tp d liu hun luyn xy dng m hnh, v s bao gm tt c cc bin


(bao gm c tin lng). Mt tp d liu th hai s c s dng cung cp thng tin phn hi trn
bng cng cng. Sau khong 6 tun, Kaggle s pht hnh mt b d liu cui cng m khng bao gm
lnh vc tin lng cho ngi tham gia. Sau , ngi tham gia s c yu cu np d on mc lng
ca h i vi mi cng vic nh gi.

5
Chapter 4

JOB SALARY PREDICTION CHY


TRN R

4.1 c d liu
Kaggle cung cp tt c d liu dng .csv nn ta cn c vo R bng phng thc read.csv

kim tra tn ct trong d liu, ta dng names()

kim tra tn s d liu, ta s dng phng thc table()

trn ta thy:
1. full_time, part_time, contract, permanent l nhng thuc tnh c trong d liu train.
2. Nhng con s th hin tn s xut hin ca thuc tnh .

4.2 Xy dng Top Sources


4.2.1 Tp d liu train
Lnh summary() cho ta nhng thng tin chnh xc v y ca d liu

6
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

Lc ra 10 Sources c tn s cao nht.

Gn tn s vo Top Sources

To thm thuc tnh Other

Gn tn s ca thuc tnh NA qua Other

4.2.2 Tp d liu test (tng t tp d liu train)


Gn tn s vo Top Sources

7
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

To thm thuc tnh Other

Gn tn s ca thuc tnh NA qua Other

4.3 Xy dng Top Location


4.3.1 Tp d liu train
m tn s ca cc thuc tnh a im v lu vo locations.counts

Lc ra top 10 locations c tn s cao nht

Gn tn s cho cc Top Locations

8
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

To thuc tnh Other

Chuyn tn s ca thuc tnh NA qua thuc tnh Other

Cng dn tn s ca thuc tnh UK vo thuc tnh Other

4.3.2 Tp d liu test


Gn tn s cho cc Top Locations

9
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

To thuc tnh Other

Chuyn tn s ca thuc tnh NA qua thuc tnh Other

Cng dn tn s ca thuc tnh UK vo thuc tnh Other

4.4 Xy dng Model


Xy dng model mc lng (SalaryNormalized) da trn: Loi cng vic (Category), Hn
hp ng (ContractTime), Top Location v Top Sources.

10
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

1. Residual: phn d (hay cn gi l phn khc bit gia gi tr thc t v tin on). Ta k vng n
gn bng 0 bi ton chnh xc hn. Nhng vn dao ng t Min -> Max.

2. Residual Standard Error: c tnh bng cch 0.16 = 0.4 v 244664 l con s thuc tnh c trong
tp d liu train.
3. Multiple R-squared: th hin c 32,69% dao ng ca ton b thuc tnh.

4559 + 6788 + 7990


R2 = = 0.3269
4559 + 6788 + 7990 + 39810

4. Adjusted R-squared: th hin ci tin ca m hnh

4559 + 6788 + 7990 + 39810


s2 = = 0.2416
8 + 10 + 85 + 244664

s2 0.1627
R2 = = 0.3266
s2

5. Df (degree of freedom): bc t do

6. Sum Sq: tng bnh phng


7. Mean Sq: trung bnh bnh phng
8. F value: gi tr F c tnh nh sau

569.91
F = = 3502.58
0.1627

9. Pr (>F): Tr s P dng kim nh F

Thc hin bc d on lng

To output

Xut output

11
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

THE END.

12
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

TI LIU THAM KHO CHNH


Bi ging ca lp Phng php nghin cu nh lng nng cao (Advanced Qualitative Research Methods),
k hiu HLN706, Queensland University of Technology, Australia.
Dupont, W.D., Statistical modeling for biomedical researchers: a simple introduction to the analysis of
complex data. second ed 2009: Cambridge Univ Press. 544.
Book: Machine Learning with R - Brett Lanz
Website: https://www.kaggle.com/c/job-salary-prediction

13

You might also like