æè¿ã§ã¯ãå€ä»æ±è¥¿ããããããªå°èª¬ããªã³ã©ã€ã³ã§å ¬éãããŠããããããã®å°èª¬ãèªã¿å§ãããæéãããããã£ãŠãè¶³ããªãã»ã©ã ãããã§ãä»åã¯ãç°¡åãªãã¬ããžå€å®ã®ææ³ã䜿ã£ãŠããã®å°èª¬ãèªãåã«ãå°èª¬ãè§£æããŠå¥œããªå°èª¬ã®åŸåãæŽãæ¹æ³ã玹ä»ããããã»ããã¢ããäžèŠã§ãã©ãŠã¶ã§äœ¿ããPythonç°å¢ã®å®è¡ç°å¢Colaboratoryã䜿ãã®ã§ãæ°è»œã«åœ¢æ çŽ è§£æãèªç¶èšèªè§£æã®åæ©ãå®è·µããŠã¿ããã
ãããå°èª¬ã¯èªã¿æŸé¡ïŒ
ä»ã¯å°èª¬å¥œãã«ã¯å ªããªãæä»£ã ãææ²»ä»¥åã®æè±ªãã¡ã®äœåã§ããã°ãå€ãã¯èäœæš©ãåããŠããã®ã§ãé空æåº«ã§èªã¿æŸé¡ã§ããªã³ã©ã€ã³å°èª¬ã®æçš¿ãµã€ãã®ãå°èª¬ãèªããïŒããªã70äžãè¶ ããã¿ã€ãã«ãèªã¿æŸé¡ã ãçè ãå°èª¬ã奜ããªã®ã§ãæã èªãã§ããã®ã ãããšã«ãããããããªçš®é¡ãããã®ã§ãã©ããéžãã§è¯ãã®ãæ©ãã»ã©ãããã§ãä»åã¯ããã¬ããžå€å®ã®ææ³ãå©çšããŠãå°èª¬ãç°¡åã«è§£æããŠã奜ããªå°èª¬ã®åŸåãæ°å€åããŠã¿ããã
ãã¬ããžå€å®ãšã¯ïŒ
ãã¬ããžå€å®ãšã¯ãææ åæ(è±èª:Sentiment Analysis)ããšåŒã°ããæè¡ã®äžçš®ã ãããã¯ãæç« ã«å«ãŸãããå¬ããããšããæ²ããããšããææ 衚çŸã«é¢é£ããåèªãæœåºããŠè§£æãè¡ãææ³ã§ãããããããã®å€å®æ¹æ³ã¯é£ãããã®ã§ã¯ãªããäŸãã°ããå¬ããããšããåèªãããã°ãååãïŒããžãã£ãïŒãªæç« ããæ²ããããšããåèªãããã°åŸãåãïŒãã¬ãã£ãïŒãªæç« ãšããããã«å€å®ããã
æ¢ã«ãããäºä»¶ã«é¢ããTwitterã®ã€ã¶ããããã¬ããžå€å®ãããšããAmazonã®ååã¬ãã¥ãŒãå€å®ãããªã©ãæ§ã ãªåéã§å©çšãããŠããã代衚çãªå¿çšäŸããYahoo!ã®ãªã¢ã«ã¿ã€ã æ€çŽ¢ã§ãæ€çŽ¢èªå¥ã®ãã¬ããžå€å®ã衚瀺ããæ©èœãããã®ã§é¢çœãã
ãã¬ããžå€å®ã®äœãæ¹
ããŠãããããå®éã«ãã£ãããã¬ããžå€å®ãäœã£ãŠããããæåã«ä»åäœæããããã°ã©ã ã®æé ã確èªããŠã¿ãããç®æ¡æžããããšã以äžã®ããã«ãªãã
- ïŒ1ïŒè§£æå¯Ÿè±¡ã®æç« ã圢æ çŽ è§£æããŠæç« ã圢æ çŽ ãšåŒã°ããæå°åäœã«åå²ãã
- ïŒ2ïŒåå²ããå圢æ çŽ ããã¬ããžå€å®çšã®èŸæžã«åèŽãããã調ã¹ãåèŽããã°ãããæ°ãäžãã
- ïŒ3ïŒæ°ããçµæãåºã«èšç®ããŠçµæã衚瀺ãã
ä»åãå°èª¬ã®è§£æãè¡ãããã«ãGoogleã®Colaboratoryã䜿ã£ãŠã¿ãããããã䜿ãã°ããã©ãŠã¶äžã§ãPythonãå®è¡ã§ãããããããæ©æ¢°åŠç¿ã§ãã䜿ãã©ã€ãã©ãªã¯ã€ã³ã¹ããŒã«æžã¿ãªã®ã§ãäœåãªæéãããããªãã
Webãã©ãŠã¶ã§ããã¡ãã®Colaboratoryã«ã¢ã¯ã»ã¹ããããGoogleã¢ã«ãŠã³ãã§ãã°ã€ã³ãããã
ãããŠãç»é¢äžéšã«ããã¡ãã¥ãŒã®ããã¡ã€ã«ããããPython3ã®æ°ããããŒãããã¯ããã¯ãªãã¯ããŠäœæãããããããšã以äžã®ãããªç»é¢ãåºãã®ã§ã圢æ çŽ è§£æã®ããã®ã©ã€ãã©ãªãJanomeããã€ã³ã¹ããŒã«ãããã以äžã®ã³ãã³ããã»ã«ã«æžã蟌ãã§å®è¡ããããïŒå®è¡ããã«ã¯ãæžã蟌ãã ããã°ã©ã ã®å·ŠåŽã«ããå®è¡ãã¿ã³ãã¯ãªãã¯ããã°è¯ããïŒ
!pip install janome
ãªããJanomeãå©çšãã圢æ çŽ è§£æã«ã€ããŠã¯ãæ¬é£èŒ18åç®ãå€ç®æŒ±ç³ãæã䜿ã£ãèšèã¯äœïŒ - æç« äžã®åèªãã«ãŠã³ãããããã§ã玹ä»ããŠããã®ã§ãåèã«ãããšè¯ãã ããã
ãã¬ããžèŸæžãæºåããã
次ã«ãååãïŒããžãã£ãïŒãåŸãåãïŒãã¬ãã£ãïŒã®å€å®ã«äœ¿ãåèªèŸæžãæ¥æ¬èªè©äŸ¡æ¥µæ§èŸæžããããŠã³ããŒããããããã¡ãã§å ¬éãããŠããããããURLã倿ŽãããŠããªããã°ã以äžã®ããã°ã©ã ãå ¥åããã°ãããŒã¿ãããŠã³ããŒãã§ããã
! curl http://www.cl.ecei.tohoku.ac.jp/resources/sent_lex/pn.csv.m3.120408.trim > pn.csv
äžèšã®ã³ãã³ãã¯ãcurlã³ãã³ããå©çšããŠãã¬ããžèŸæžãããŠã³ããŒããããpn.csvããšãããã¡ã€ã«ã«ä¿åãããšãããã®ã ã
å°èª¬ãããŠã³ããŒãããã
次ã«ãè§£æå¯Ÿè±¡ãšãªãå°èª¬ãããŠã³ããŒãããŠã¿ãããå°èª¬ã¯ãªã³ã©ã€ã³å°èª¬ã§ããããšã念é ã«çœ®ãã®ã§ãHTML圢åŒã§ãããšãããããã§ã¯ãé空æåº«ã«ããã倪宰治ã®å°èª¬ãèµ°ãã¡ãã¹ããè§£æããŠã¿ãããããã§ããã³ãã³ããå®è¡ããŠããŠã³ããŒãããããããŠã³ããŒããããã¡ã€ã«ã¯ãsyosetu.htmlããšããååã§ä¿åããã
!curl https://www.aozora.gr.jp/cards/000035/files/1567_14913.html > syosetu.html
å°èª¬ãããŠã³ããŒãããããHTMLã®ã¿ã°ãé€å»ããããç»é¢äžã§ã¯ãã ã®ããã¹ãã§ãããã©ãŠã¶ã®ãããã¢ããã¡ãã¥ãŒãããããŒãžã®ãœãŒã¹ã衚瀺ãã§èŠãŠã¿ããšãããããã®HTMLã¿ã°ãåã蟌ãŸããŠããã®ãåãããããã§ãHTMLã®ã¿ã°ãé€å»ããŠãããã¹ãã ãã«ãããã
HTMLã®ã¿ã°ãåé€ããã«ã¯ãBeautifulSoup4ãšããã©ã€ãã©ãªã䜿ããšè¯ããããã¯ãColaboratoryã«æåããã€ã³ã¹ããŒã«ãããŠããã®ã§è¿œå ã€ã³ã¹ããŒã«ã®å¿ èŠã¯ãªããïŒããããã€ã³ã¹ããŒã«ãããŠããªããã°ã! pip install beautifulsoup4ãã§ã€ã³ã¹ããŒã«ã§ãããïŒ
以äžã®ã³ãŒããå®è¡ããããšã§ãHTMLã®ã¿ã°ãåé€ããŠãã¡ã€ã«ã«ä¿åããã
from bs4 import BeautifulSoup
# ãã¡ã€ã«ãèªã¿èŸŒã
with open("syosetu.html", "rt", encoding="sjis") as f:
html = f.read()
# HTMLãããŒã¹ãã
soup = BeautifulSoup(html, 'html.parser')
# ã«ããåé€
soup.find("rp").extract()
soup.find("rt").extract()
# ããã¹ãã ããåãåºã
text = soup.get_text()
print(text)
# ä¿å
with open("syosetu.txt", "wt", encoding="utf-8") as w:
w.write(text)
å®è¡ãããšã以äžã®ããã«ããã¹ãã衚瀺ãããããŸããã¡ã€ã«ãsyosetu.txtãã«çµæãä¿åãããã
ã¡ãªã¿ã«ããé空æåº«ãã®HTMLã®æåãšã³ã³ãŒãã£ã³ã°ã¯äžè¬çãªãUTF-8ãã§ã¯ãªããShift_JISããšãªã£ãŠããããå°èª¬ãèªããïŒãã®HTMLã¯UTF-8ãªã®ã§ããã¡ã€ã«ãèªã¿èŸŒãéã«ã¯ãæåãšã³ã³ãŒãã£ã³ã°ã®éšåãæžãæãããããªããããã°ã©ãã³ã°ã§å€§éã®ãã¡ã€ã«ãé£ç¶ã§ããŠã³ããŒãããã®ã¯ããµãŒããŒã«è² è·ããããè¡çºãšãªãã®ã§æ°ãã€ãããã
ãã¬ããžèŸæžãèªã¿èŸŒãã
次ã«ãå ã»ã©ããŠã³ããŒãããæ¥æ¬èªè©äŸ¡æ¥µæ§èŸæžïŒãã¡ã€ã«å:pn.csvïŒãPythonã®èŸæžåœ¢åŒãšããŠèªã¿èŸŒãããããã§ã¯ãCSVãã¡ã€ã«ããPythonã®èŸæžåã®å€æ°ãnp_dicãã«èªãã
# ãã¬ããžèŸæžãèªã
import csv
np_dic = {}
fp = open("pn.csv", "rt", encoding="utf-8")
reader = csv.reader(fp, delimiter='\t')
for i, row in enumerate(reader):
name = row[0]
result = row[1]
np_dic[name] = result
if i % 500 == 0: print(i)
print("ok")
ãã®ããã°ã©ã ãå®è¡ããŠãokããšè¡šç€ºããããã以äžã®ãããªã³ãŒããèšè¿°ããŠèŸæžãèªã¿èŸŒããã確èªããŠãããã
print(np_dic["ç¬é¡"])
print(np_dic["å«ã"])
print(np_dic["æé"])
å®è¡ãããšã以äžã®ããã«è¡šç€ºããããçµæã®ãpããããžãã£ãããnãããã¬ãã£ãã§ããeãã¯ã©ã¡ããšãèšããªããã¥ãŒãã©ã«ãªåèªãšãªãã
圢æ çŽ è§£æããŠãã¬ããžãæ°å€åããã
ããã§ã¯ã圢æ çŽ è§£æãè¡ãããã¬ããžå€å®ããããColaboratoryã«ä»¥äžã®ããã°ã©ã ãæžã蟌ãã§å®è¡ããããäœããäžèšã®æé ããã¹ãŠå®è¡ããŠããå¿ èŠãããã®ã§æ³šæãããã
# å°èª¬ãèªã¿èŸŒã
fp = open("syosetu.txt", "rt", encoding="utf-8")
text = fp.read()
# 圢æ
çŽ è§£æ
from janome.tokenizer import Tokenizer
tok = Tokenizer()
# æ°ãã
res = {"p":0, "n":0, "e":0}
for t in tok.tokenize(text):
bf = t.base_form # åºæ¬åœ¢
# èŸæžã«ããã確èª
if bf in np_dic:
r = np_dic[bf]
if r in res:
res[r] += 1
# çµæã衚瀺
print(res)
cnt = res["p"] + res["n"] + res["e"]
print("ããžãã£ã床", res["p"] / cnt)
print("ãã¬ãã£ã床", res["n"] / cnt)
å®è¡ãããšã以äžã®ãããªçµæã衚瀺ããããã€ãŸãããèµ°ãã¡ãã¹ãã¯ããžãã£ã床0.29ãšãã¬ãã£ã床0.20ãšãªããè¥å¹²ãååããªå°èª¬ã§ããããšãåãã£ãã
ä»ã®å°èª¬ã詊ããŠã¿ã
ããã€ãã®å°èª¬ã詊ããŠã¿ããšãããæ¬¡ã®ãããªçµæãšãªã£ãã
倪宰治ãèµ°ãã¡ãã¹ãã®å ŽåïŒ
{'p': 118, 'n': 83, 'e': 208}
ããžãã£ã床 0.2885085574572127
ãã¬ãã£ã床 0.20293398533007334
å€ç®æŒ±ç³ãåŸèŒ©ã¯ç«ã§ãããã®å ŽåïŒ
{'p': 3324, 'n': 3536, 'e': 7691}
ããžãã£ã床 0.22843790804755687
ãã¬ãã£ã床 0.24300735344649851
è¥å·éŸä¹ä»ãçŸ çéãã®å ŽåïŒ
{'p': 49, 'n': 76, 'e': 133}
ããžãã£ã床 0.18992248062015504
ãã¬ãã£ã床 0.29457364341085274
以äžã®ãããªã³ãŒãã§ã°ã©ããæç»ããã4è¡ããã°ã©ã ãæžããšã°ã©ããã§ãã®ãè¯ããšããã ã
import pandas as pd
df = pd.DataFrame({'nega_posi':[res['p'], res['n'], res['e']]},
index=['Positive','Negative','e'])
df.plot.pie(y='nega_posi', figsize=(6,6))
ãŸãšã
ãã®ããã«å°èª¬ããã¬ããžå€å®ããŠãæ°å€ã§èŠããããšããã®ã¯ãšãŠãæ°é®®ã§é¢çœããã®ã ãä»åã®ããã°ã©ã ã§ã¯ãããžãã£ã床ãå€å®ããã®ã«ãããžãã£ããªåèªæ° ÷ å€å®ã§ããå šäœã®åèªæ°åèªãã®ããã«èšç®ãããç°¡åãªèšç®ã ãç©èªã®é°å²æ°ã§ãã®æ°å€ã¯ãã£ãšå€ãããåçŽãªãããããªãã«å€å®ã§ããŠãããšæããå°èª¬ä»¥å€ã«ãå¿çšã§ããã®ã§ã詊ããŠã¿ããã
èªç±åããã°ã©ããŒããããã¯ãã©ã«ãŠãããã°ã©ãã³ã°ã®æ¥œãããäŒããæŽ»åãããŠããã代衚äœã«ãæ¥æ¬èªããã°ã©ãã³ã°èšèªããªã§ããã ãããã¹ã鳿¥œããµã¯ã©ããªã©ã2001幎ãªã³ã©ã€ã³ãœãã倧è³å ¥è³ã2004幎床æªèžãŠãŒã¹ ã¹ãŒããŒã¯ãªãšãŒã¿èªå®ã2010幎 OSSè²¢ç®è ç« åè³ãæè¡æžãå€ãå·çããŠããã






