å šäœã®æµããæèããŠæ©æ¢°åŠç¿ã«åãçµã
ååã¯ãæåž«ããåŠç¿/åé¡ã«é¢ããŠããæ±ºå®æšããããžã¹ãã£ãã¯ååž°ãææ³ã®éããçŽè§£ããŠãããŸãããææ³ã«ããéããã©ããã£ããã®ãªã®ããçŽæçã«ã€ã¡ãŒãžã§ããŸããã§ããããã
ãããŸã§ã¯ãæ©æ¢°åŠç¿ãçŽæçã«çè§£ããŠãããããã«ãç°¡æçãªããŒã¿ãæºåããæ©æ¢°åŠç¿ã®éšåã ããåãæ±ã£ãŠããŸãããããããæ¬æ¥ã¯ã1)æ©æ¢°åŠç¿ã䜿ãããã®ç®çã解決ãã¹ã課é¡ã®èšèšã2)ããŒã¿èšèšã3)ããŒã¿ã®åéã»å å·¥ã4)ããŒã¿ã®ææ¡ãè¡ãããããã5)æ©æ¢°åŠç¿ãå®è¡ã§ããŸããä»åã¯ããããŸã§ãããè€éãªããŒã¿ãçšæããã®ã§ãå šäœã®æµã(1ã5ã®å·¥çš)ãæèããªããæ©æ¢°åŠç¿ã«åãçµãã§ãããŸãããã
ããŒã¿ã®æºåã»æŠèŠææ¡
ä»åã¯ãæ¶è²»è ç©äŸ¡ææ°ãããæ¶è²»è æ¯åºãå¢å ããããäºæž¬(åé¡)ããŠã¿ãŸããããŒã¿ã¯ããããŸã§ãšåæ§ã«æ¿åºçµ±èšã®ç·åçªå£e-statã®ããŒã¿ã䜿çšããŸãã
ç®ç倿°ãšããŠãææ¬¡ã®æ¶è²»è æ¯åºãå®¶èšèª¿æ»ããååŸããå æãããæ¯åºãå¢å ããå Žåã¯ãã©ã°1ããæžå°ããå Žåã¯ãã©ã°0ãä»äžããŸããã説æå€æ°ãšããŠã¯ãææ¬¡ã®æ¶è²»è ç©äŸ¡ææ°ã2015å¹Žåºæºæ¶è²»è ç©äŸ¡ææ°ããååŸããå å·¥ããŠãããŸããä»åã¯åé ã«ããŸã现ããããŒã¿ã®èª¬æãè¡ããŸãããã©ããªããŒã¿ãªã®ããææ¡ããŠããäœæ¥ãäžç·ã«ãã£ãŠãããŸãããã
ãŸãã¯ããã¡ãããããŒã¿ãããŠã³ããŒãããJupyter Notebookã®äœæ¥ãã©ã«ãã«ã³ããŒããŠãã ãããJupyter Notebookãç«ã¡äžããå³äžã®NewãããNotebookãéããã¿ã€ãã«åã倿ŽããŠãããŸããããä»åã¯ãConsumerExpendituresãšããååã«ããŸããã
ãŸãã¯ãããŒã¿ãèªã¿èŸŒãã§ã¿ãŸãã
import pandas as pd
data = pd.read_csv('ConsumerExpenditures.csv')
data.head()
ãŸãã¯ãå é 5è¡ã®ã¿ã衚瀺ããŸãããããŸã§ãšéããå šããŒã¿ãç»é¢ã«è¡šç€ºãããŠãããããâŠãã§çç¥ãããŠããŸããåæ°ã¯ãäžã®63columnsãã63åååšããããšãããããŸãããããŸã§ã¯10åçšåºŠã§ããããå€§å¹ ã«åæ°ãããªãã¡èª¬æå€æ°ã®æ°ãå¢ããŠããŸãããã®ãŸãŸã§ã¯ãããŒã¿æ°ããã©ããªåãããããå®å šã«ææ¡ã§ããªãã®ã§ããŸãã¯å šäœææ¡ã«åªããŸããããäžèšãã³ãŒããå®è¡ããŠã¿ãŠãã ããã
data.count()
countãæå®ããããšã§ãååãšãã®åã«å ¥ã£ãŠããããŒã¿æ°ã衚瀺ãããŸããã©ã®ããŒã¿ã387ãšããããã«ãä»åã®ããŒã¿ã¯387è¡ååšããããšãããããŸããããã§ãè¡æ°ã«é¢ããŠãããããŸã§ã®50è¡çšåºŠãã倧ããå¢ããŠããããšãçè§£ã§ããããšæããŸãããŸããä»åã®ããŒã¿ã§ã¯ååšããªãã®ã§ãããããŒã¿ãå ¥åãããŠãããããŒã¿æ¬ æ(null)ãçããŠããããšãå€ã ãããŸããcountãå®è¡ãããšãããŒã¿ã®æ¬ æå€ã¯æ°ã«å«ãŸããªãã®ã§ãæ¬ æããããã©ããã®ææ¡ãç°¡æçã«è¡ããŸãã
ããŠãååãcountã«ãã£ãŠå€§åèŠããŠããŸãããããã£ãããªã®ã§ååã ãåãåºããŠã¿ãŸãããã
data.columns
ããã«ãã£ãŠãååããã¹ãŠååŸã§ããŸããããã®ååãèŠãŠãããšãä»åã®ç®ç倿°ã§ããæ¶è²»æ¯åºå¢å ãã©ã°ãšããã以éã®èª¬æå€æ°ã«å€§ããåãããããšãããããŸãã説æå€æ°ãèŠãŠãããšã第3åã§æ±ã£ãæ¶è²»è ç©äŸ¡ææ°ã®é ç®ãšåãã§ãããã ããåœæ(Date)ã«å¯ŸããŠ1ã«æåãã6ã«æåãŸã§é¡ã£ãŠæ¶è²»è ç©äŸ¡ææ°ãçšæãããŠããŸãããã®ãããªããŒã¿ãçšæããçç±ã¯ãåé ã«è¿°ã¹ã1)æ©æ¢°åŠç¿ã®ç®çãšé¢é£ããŠããŸãã
äœåºŠãè¿°ã¹ãŠããŸãããæ©æ¢°åŠç¿ã®æ¬æ¥ã®ç®çã¯ãæªç¥ãªäºäŸãäºæž¬ããããšã«ãããŸããä»åã®äŸã§ãããšãæ¥æã®æ¶è²»è æ¯åºãäºæž¬ããããšã§ããæ¥æã®äºæž¬ãè¡ãéã«ãæ¥æã®ããŒã¿ã䜿ãããšã¯ã§ããã®ã§ãããããããããããã®ããã«ãæ¥æã®ããŒã¿ã¯ãŸã ååšããªãããã仿ã®ããŒã¿ã®ã¿ã§ãæ¥æã®äºæž¬ãè¡ãå¿ èŠããããŸãããã®ãããéå»ã«é¡ã£ãŠæ¶è²»è ç©äŸ¡ææ°ãçšæããŠããã®ã§ããç©äŸ¡ãäžããã°ããéãåºãããããªããšããã®ã¯ã€ã¡ãŒãžããããããšæããŸããä»åã¯åãæ±ããŸããããæ£ç¢ºã«ã¯ãéå»ã®æ¶è²»è ç©äŸ¡ææ°ãæ¯åºã«å¯äžããŠããã®ãã°ã©ãåãããããŠç¢ºèªãããšãªãè¯ããšæããŸããäœã«æåã®ç©äŸ¡ãæ¯åºã«å¯äžããŠãããããããªãããã1ã6ã«æåãŸã§ã®ããŒã¿ãããããæœåºãæ©æ¢°åŠç¿ã§ãã圢ã«ããŒã¿ãèšèšã(2.ããŒã¿èšèšå·¥çš)ãå å·¥ãè¡ãªã£ãŠããŸã(3.ããŒã¿ã®åéã»å 工工çš)ã
ããŒã¿ã®æºå
ããã§ã¯ãæ©æ¢°åŠç¿ã®ããã«ãããŒã¿æºåãããŠãããŸããããå ã»ã©èª¿ã¹ãŠããã£ãããšæããŸãããç®ç倿°ã¯æ¶è²»æ¯åºå¢å ãã©ã°ã説æå€æ°ã¯é£æãã諞éè²»ã1ã«æåãã6ã«æåãŸã§ã®ããŒã¿ãšãªããŸãã
data_tmp = data.copy()
del data_tmp['Date']
del data_tmp['æ¶è²»æ¯åº']
data_tmp_X = data_tmp.copy()
del data_tmp_X['æ¶è²»æ¯åºå¢å ãã©ã°']
data_tmp_Y = data_tmp['æ¶è²»æ¯åºå¢å ãã©ã°']
data_tmp_X.columns
ãŸãã¯ãèªã¿èŸŒãã ããŒã¿ãæ®ãããã«ãdatatmpãäœæããŸãããããããäžèŠãªããŒã¿ãåé€ããåŸã説æå€æ°datatmpXãç®ç倿°datatmpYãäœæããŸããæåŸã«ãdatatmp_X.columnsã§ãåããã£ããçšæã§ããŠããã確èªããŠããŸããæ¬¡ã«ãããŒã¿ãèšç·ŽããŒã¿ãšããã¹ãããŒã¿ã«åå²ããŸãã
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(data_tmp_X, data_tmp_Y,random_state=0, test_size=0.3)
print(len(X_train))
print(len(Y_train))
print(len(X_test))
print(len(Y_test))
åå²ã«ã¯ãtraintestsplitã䜿çšããŠããŸãããµã€ãºã¯ãèšç·Ž70%ããã¹ã30%ã«åå²ããŸããããã£ããåå²ã§ããŠããã確ãããããã«ãlenãçšããŠããŒã¿æ°ãåçŽã«åç®ããprintã§åºåããŠããŸããXtrainãYtrainã¯270ä»¶ãXtestãYtestã¯110ä»¶ãšãªã£ãŠããã387ä»¶ãæå®éãåå²åºæ¥ãŠããããšãããããŸãããããã§ãæ©æ¢°åŠç¿ã®ããã®æºåã¯æŽããŸããã
æ±ºå®æš
ãŸãã¯ãæ±ºå®æšã«ããäºæž¬ã¢ãã«äœæããã£ãŠãããŸãããã
from sklearn.tree import DecisionTreeClassifier
treeModel = DecisionTreeClassifier(max_depth=3, random_state=0)
treeModel.fit(X_train, Y_train)
from sklearn import metrics
print(metrics.accuracy_score(Y_train, treeModel.predict(X_train)))
print(metrics.accuracy_score(Y_test, treeModel.predict(X_test)))
1è¡ç®ã§ãæ±ºå®æšã®ã€ã³ããŒãã2è¡ç®ã§äºæž¬ã¢ãã«ã®å®çŸ©ãè¡ãªã£ãŠããŸããä»åã®æšã®æ·±ãã¯ããšãããã3ã§ãã£ãŠã¿ãŸããããã®åŸãFitã§ãäºæž¬ã¢ãã«ãäœæããããã«ãç²ŸåºŠãææ¡ããããã«ãaccuracy_scoreã§æãç°¡æçãªã¢ãã«è©äŸ¡(粟床)ãè¡ãªããŸãããèšç·ŽããŒã¿ã§78%ããã¹ãããŒã¿ã§60%ãšãªããŸãããèšç·ŽããŒã¿ã§ã®ç²ŸåºŠãé«ããããèšç·ŽããŒã¿ã«éå°é©å(éåŠç¿)åŸåã«ãããŸãããèšç·Žããã¹ãããŒã¿ã©ã¡ããçšããŠã60%以äžã®ç²ŸåºŠã¯åºãŠããŸãã
éåŠç¿åŸåãªãããã¢ãã«ã®è€éããæžãããŠã¿ãŸããããã¢ãã«ãã·ã³ãã«ã«ããæ¹æ³ãšããŠãmax_depthãå°ããããæ¹æ³ããããŸãã
treeModel = DecisionTreeClassifier(max_depth=2, random_state=0)
treeModel.fit(X_train, Y_train)
print(metrics.accuracy_score(Y_train, treeModel.predict(X_train)))
print(metrics.accuracy_score(Y_test, treeModel.predict(X_test)))
max_depthã2ã«ãããšãèšç·ŽããŒã¿ã®ç²ŸåºŠã¯æžå°ãããã¹ãããŒã¿ã®ç²ŸåºŠã¯åäžããŸãããããã«ãã£ãŠã粟床ã¯äžãããŸããããã¢ãã«ã¯æ±åãããããè¯ãã¢ãã«ã«ãªã£ããšèããããŸããæåŸã«ãã¢ãã«ã«å¯äžããŠãã倿°ãæœåºããŠã¿ãŸãããã
Importance = pd.DataFrame({'倿°å':data_tmp_X.columns, 'éèŠåºŠ':treeModel.feature_importances_})
Importance[Importance['éèŠåºŠ'] != 0]
ããã ãèŠããšããã¹ãŠè¡£é¡é¢ä¿ã®ç©äŸ¡ãå¯äžããŠãããšããçµæã«ãªããŸãããè§£éã¯é£ãããšããã§ãããè¡£é¡é¢é£ã ãã§ãã¹ãŠã®æ¶è²»æ¯åºã決ããŠãããšã¯èãã«ããã§ãããŸã ã¢ãã«ç²ŸåºŠãäœãããšãèæ ®ããå¿ èŠããããšæããŸãããããããã®ããŒã¿ã»ããã䜿çšããéããããä»¥äžæ±ºå®æšã®ã¢ãã«ç²ŸåºŠãåäžãããã®ã¯å°é£ãªãããããžã¹ãã£ãã¯ååž°ã«ææŠããã¢ãã«ç²ŸåºŠãæ¯èŒããŠã¿ãŸãããã
ããžã¹ãã£ãã¯ååž°
䜿çšããããŒã¿ã¯ãæ±ºå®æšãšå šãåãã§è¯ãã®ã§ãèšç·ŽããŒã¿ããã¹ãããŒã¿ã«åå²ãããŠãããã®ã䜿çšããŸãã
from sklearn.linear_model import LogisticRegression
logModel = LogisticRegression()
logModel.fit(X_train, Y_train)
print(metrics.accuracy_score(Y_test, logModel.predict(X_test)))
print(metrics.accuracy_score(Y_train, logModel.predict(X_train)))
æ±ºå®æšã®ãšããšåæ§ã«ã1è¡ç®ã§ããžã¹ãã£ãã¯ååž°ã®ã€ã³ããŒãã2è¡ç®ã§äºæž¬ã¢ãã«ã®å®çŸ©ãè¡ãªã£ãŠããŸããæ±ºå®æšãšç°ãªããç¹ã«ãã©ã¡ãŒã¿æå®ã¯ãããããã©ã«ãã®ãã®ã䜿çšããŠããŸãããã®åŸãFitã§äºæž¬ã¢ãã«ãäœæããããã«ãç°¡æçãªç²ŸåºŠè©äŸ¡ãè¡ããŸããããã¡ãã¯é©ãããšã«ãèšç·ŽããŒã¿ã§83%ããã¹ãããŒã¿ã§ã79%ã瀺ããŸãããæ±ºå®æšãããå šäœçãªç²ŸåºŠãé«ãã ãã§ãªããèšç·ŽããŒã¿ãšãã¹ãããŒã¿ã®ç²ŸåºŠã«éãããªããããæ±åãããŠããè¯ãã¢ãã«ãšãªã£ãŠããŸãã
ååè¿°ã¹ãããã«ãææ³ã«ãã£ãŠåé¡ã®æ¹æ³ãç°ãªããããä»åã®ã±ãŒã¹ã®ããã«å€§ããã¢ãã«ç²ŸåºŠãç°ãªãããšããããŸããããã€ãã®ææ³ãçšããŠé©åã«è©äŸ¡ãè¡ããæé©ãªã¢ãã«ãäœæããŠããããšã®éèŠæ§ãçè§£ã§ããã®ã§ã¯ãªãã§ããããã
ãã£ãããªã®ã§ãããžã¹ãã£ãã¯ååž°ã®ã¢ãã«ã«ã€ããŠãããå°ãè§ŠããŠãããŸããããååãããžã¹ãã£ãã¯ååž°ã¯ãæ±ºå®æšã®ããã«æ¡ä»¶åå²ã«ãã£ãŠç·ãåŒããŠããã®ã§ã¯ãªãããããŸã§ãå€å€æ°ã®åŒãå ã«åå²ç·ãåŒããã®ã§ãããšãäŒãããŸãããããžã¹ãã£ãã¯ååž°ã¯ãç·åœ¢ååž°ã®é¢æ°Y=aX1+bX2+zã0ãã1ã®ç¯å²ã«æŒã蟌ããŠããŸã颿°ã䜿çšããŠããŸãããã®ãããç·åœ¢ååž°ãšåæ§ã«ãã¢ãã«ãäœæããããšã¯ãç·ãåŒãããšã§ãããç·ãåŒãããšã¯ä¿æ°(aãbãz)ãæ±ºããããšãšå矩ã§ãã
ããžã¹ãã£ãã¯ååž°ã«ã¯ãæ±ºå®æšã®ããã«feature_importancesãååšããŸããããä¿æ°ã«ãã£ãŠããçšåºŠå€æ°ã®å¯äžããããããã«ãªã£ãŠããŸããæ©éãåºåããŠã¿ãŸãããã
Coef = pd.DataFrame({'倿°å':data_tmp_X.columns, 'ä¿æ°':logModel.coef_[0]})
Coef.sort_values('ä¿æ°')
1è¡ç®ã§ãä»åäœæããã¢ãã«ãããä¿æ°ãæœåºããŠããŸãããããã®ä¿æ°ã¯ãY=aX1+bX2+zã®aãbã«çžåœããŸããã€ãŸããããããã®å€æ°ã«æãã£ãŠããã絶察å€ã倧ãããã°å€§ããã»ã©å¯äžã¯å€§ãããæ£ã§ããã°æ¯åºå¢å ã«ãè² ã§ããã°æ¯åºæžå°ã«å¯äžããŠããããšã«ãªããŸãã絶察å€ã倧ããéšåã«ã¯ã1ã«æåã®ç©äŸ¡ææ°ã¯ååšããŠãããã1ã«æåã®ç©äŸ¡ææ°ã®å¯äžã¯å°ããããšãèããããŸããã€ãŸããæ¯åºå¢æžã«ã¯å°ãªããšã2ã«æä»¥äžåã®ç©äŸ¡ã倧ãã圱é¿ããŠãããšäºæ³ã§ããŸãããã ããäœå® ã®ç©äŸ¡ææ°ããæ£ã«ãè² ã«ãå ¥ã£ãŠãããæ¯æ°ã®æ³¢çãé¢ä¿ããŠããå¯èœæ§ããããŸããããã«ããè¯ãã¢ãã«ã«ããŠããå ŽåãããŒã¿ã®ç²åºŠã説æå€æ°ã®èŠçŽããå¿ èŠã«ãªããšæããŸãã
ããŠãä»åã¯ããããŸã§åŠãã§ããããšã掻ãããããå®è·µã«è¿ã圢ã§ãããŒã¿ãæ±ããæ©æ¢°åŠç¿ã«ããã¢ãã«äœæãŸã§è¡ããŸãããååãŸã§ã®åŸ©ç¿ã«å ããŠãããŒã¿ã®æ±ãæ¹ã®ã€ã¡ãŒãžããæ©æ¢°åŠç¿ã®ã¢ãã«æ§ç¯ããã»ã¹ã®çè§£ã¯é²ã¿ãŸããããããããã¯ãèªåã§æ©ã¿ãªããã解決ããã課é¡ãèããŠãããããªããŒã¿ã«ææŠããŠããããšãæãéèŠã ãšæããŸããããããªèª²é¡ã«åãçµãããšã§ãæ©æ¢°åŠç¿ã®ç²ŸåºŠã ãã«ãšãããããçã«äžã®äžã«å¿ èŠãªæ©æ¢°åŠç¿ã¢ãã«ãäœããããã«ãªããšæããŸãããŸããèªåã§èª²é¡ãèããŠãè§£æ±ºææ®µãèããããšã§ãèªç¶ã𿩿¢°åŠç¿ææ³ã®ããªãšãŒã·ã§ã³ãå¢ããŠãããšæããŸãã
次åã¯ããããŸã§ãšã¯éã£ãããŒã¿ãæ±ããªãããæ©æ¢°åŠç¿ã®æªæ¥ã«ãè§ŠããŠãããããšæããŸãã
èè ãããã£ãŒã«
äžå±±èŒæ
倧æé»æ©ã¡ãŒã«ãŒã«ãŠãããŒããŠã§ã¢ã®ç ç©¶éçºã«åŸäºããåŸãç¬ç«ãç¬ç«åŸã¯ãœãããŠã§ã¢ãããŒã¿åæçã«ãããŠå®åçµéšãç©ããšãšãã«ãæ°ç€Ÿãå ±å嵿¥ããã®äžã§ãååäŒç€Ÿã¢ã€ãã¥ããŒã¿ã§ã¯ã人工ç¥èœã»IoTãªã©ã®å¯èœæ§ãæ¹åæ§ãç ç©¶ããŠãããæè¿ã§ã¯ããªãŒãã³ããŒã¿ã«çç®ãããªãŒãã³ããŒã¿æŽ»çšã®ããã®webãµãŒãã¹ã®ç«ã¡äžãããªãŒãã³ããŒã¿ÃIoTã«ãã䟡å€åµåºã1ã€ã®ããŒãã«åãçµãã§ããã









