cs Word2vec 알고리즘 리뷰 3 : 네이버 영화 리뷰 데이터를 이용한 실습
본문 바로가기
  • 매일 한걸음씩
  • 매일 한걸음씩
개발/NLP(Natural Language Processing)

Word2vec 알고리즘 리뷰 3 : 네이버 영화 리뷰 데이터를 이용한 실습

by 시몬쯔 2020. 5. 30.
728x90
반응형

 

네이버 영화 리뷰 dataset 출처 :  https://github.com/e9t/nsmc/

 

저번 포스팅에서 대략적인 Word2Vec 이론을 다뤘으니 실습을 해보도록 하자.

 

 

 

In [ ]:
# import matplotlib as mpl
# import matplotlib.pyplot as plt

# %config InlineBackend.figure_format = 'retina'

# !apt -qq -y install fonts-nanum

# import matplotlib.font_manager as fm
# fontpath = '/usr/share/fonts/truetype/nanum/NanumBarunGothic.ttf'
# plt.rc('font', family='NanumBarunGothic')
# mpl.font_manager._rebuild()
In [70]:
# 그래프를 노트북 안에 그리기 위해 설정
%matplotlib inline

# 필요한 패키지와 라이브러리를 가져옴
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm

# 그래프에서 마이너스 폰트 깨지는 문제에 대한 대처
mpl.rcParams['axes.unicode_minus'] = False
In [73]:
font_list = fm.findSystemFonts(fontpaths=None, fontext='ttf')

# ttf 폰트 전체갯수
print(len(font_list)) 
 
834
In [74]:
# 시스템 폰트에서 읽어온 리스트에서 상위 10개만 출력
font_list[:10] 
Out[74]:
['C:\\Windows\\Fonts\\DUBAI-REGULAR.TTF',
 'C:\\WINDOWS\\Fonts\\corbell.ttf',
 'C:\\WINDOWS\\Fonts\\LTYPEO.TTF',
 'C:\\WINDOWS\\Fonts\\OUTLOOK.TTF',
 'C:\\WINDOWS\\Fonts\\STXINWEI.TTF',
 'C:\\Windows\\Fonts\\PAPYRUS.TTF',
 'C:\\Windows\\Fonts\\pala.ttf',
 'C:\\WINDOWS\\Fonts\\HANDotumExt.ttf',
 'C:\\Windows\\Fonts\\HANBatangB.ttf',
 'C:\\Windows\\Fonts\\BOD_BLAR.TTF']
In [75]:
[(f.name, f.fname) for f in fm.fontManager.ttflist if 'Nanum' in f.name]
Out[75]:
[('NanumGothic', 'C:\\Windows\\Fonts\\\x7f\x7f\x7f\x7fBOLD.TTF'),
 ('NanumGothic', 'C:\\Windows\\Fonts\\\x7f\x7f\x7f\x7f.TTF'),
 ('NanumGothic', 'C:\\WINDOWS\\Fonts\\\x7f\x7f\x7f\x7fEXTRABOLD.TTF')]
In [76]:
path = 'NanumGothic.ttf'
font_name = fm.FontProperties(fname=path, size=50).get_name()
print(font_name)
plt.rc('font', family=font_name)
 
NanumGothic
In [77]:
import matplotlib.pyplot as plt
In [79]:
import pandas as pd
In [80]:
import urllib.request
%matplotlib inline
import matplotlib.pyplot as plt
import re
from konlpy.tag import Okt
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
In [81]:
train_data = pd.read_table('./nsmc/ratings_train.txt')
test_data = pd.read_table('./nsmc/ratings_test.txt')
In [82]:
train_data.head()
Out[82]:
  id document label
0 9976970 아 더빙.. 진짜 짜증나네요 목소리 0
1 3819312 흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나 1
2 10265843 너무재밓었다그래서보는것을추천한다 0
3 9045019 교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정 0
4 6483659 사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ... 1
In [83]:
test_data.head()
Out[83]:
  id document label
0 6270596 굳 ㅋ 1
1 9274899 GDNTOPCLASSINTHECLUB 0
2 8544678 뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아 0
3 6825595 지루하지는 않은데 완전 막장임... 돈주고 보기에는.... 0
4 6723715 3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠?? 0
In [84]:
print(len(train_data))
 
150000
 

print(len(test_data))

In [85]:
train_data['document'].nunique(), train_data['label'].nunique() 
#중복을 제외하 갯수를 체크해보니 15만개에서 146182개가 되었다.
#label은 0또는1이니 2가 된다
Out[85]:
(146182, 2)
In [86]:
train_data.drop_duplicates(subset=['document'], inplace=True) # document 열에서 중복인 내용이 있다면 중복 제거
In [87]:
print(len(train_data))
 
146183
In [88]:
train_data['label'].value_counts().plot(kind = 'bar')
Out[88]:
<matplotlib.axes._subplots.AxesSubplot at 0x22209262a90>
 
In [89]:
print(train_data.groupby('label').size().reset_index(name = 'count'))
 
   label  count
0      0  73342
1      1  72841
In [90]:
print(train_data.isnull().values.any())
 
True
In [91]:
print(train_data.isnull().sum())
 
id          0
document    1
label       0
dtype: int64
In [92]:
train_data.loc[train_data.document.isnull()]
Out[92]:
  id document label
25857 2172111 NaN 1
In [93]:
train_data = train_data.dropna(how = 'any')
print(train_data.isnull().values.any())
 
False
In [94]:
print(len(train_data))
 
146182
 

한글과 공백을 제외한 문자들을 제거해보자

In [95]:
train_data['document'] = train_data['document'].str.replace("[^ㄱ-하-ㅣ가힣]"," ")
In [96]:
train_data.head()
Out[96]:
  id document label
0 9976970 아 더빙 진짜 짜증나네요 목소리 0
1 3819312 포스터보고 초딩영 줄 오버연기조차 가볍지 않구나 1
2 10265843 너무재밓었다그래서보는것을추천 다 0
3 9045019 교도소 이야기구먼 솔직 재미는 없다 평점 조정 0
4 6483659 사이몬페그의 익살스런 연기가 돋보였던 영 스파이더맨에서 늙어보이기만 던 커스틴 ... 1
In [97]:
train_data.isnull().values.any()
Out[97]:
False
In [98]:
train_data['document'].replace('',np.nan,inplace=True)
print(train_data.isnull().sum())
 
id          0
document    0
label       0
dtype: int64
In [99]:
train_data.loc[train_data.document.isnull()]
Out[99]:
  id document label
In [100]:
train_data = train_data.dropna(how='any')
In [101]:
print(len(train_data))
 
146182
 

Test data에도 똑같이 해줌

In [102]:
test_data.drop_duplicates(subset = ['document'], inplace=True) # document 열에서 중복인 내용이 있다면 중복 제거
test_data['document'] = test_data['document'].str.replace("[^ㄱ-ㅎㅏ-ㅣ가-힣 ]","") # 정규 표현식 수행
test_data['document'].replace('', np.nan, inplace=True) # 공백은 Null 값으로 변경
test_data = test_data.dropna(how='any') # Null 값 제거
print('전처리 후 테스트용 샘플의 개수 :',len(test_data))
 
전처리 후 테스트용 샘플의 개수 : 48995
 

토큰화

In [103]:
stopwords = ['의','가','이','은','들','는','좀','잘','걍','과','도','를','으로','자','에','와','한','하다']
In [104]:
#불용어 정의한다.(조사나 접속사 등)
In [105]:
okt = Okt()
In [106]:
okt.morphs('와 이런 것도 영화라고 차라리 뮤직비디오를 만드는 게 나을 뻔', stem = True)
Out[106]:
['오다', '이렇다', '것', '도', '영화', '라고', '차라리', '뮤직비디오', '를', '만들다', '게', '나다', '뻔']
 

okt를 이용하여 train_data의 document에 있는 불용어들을 제거하자.

In [107]:
X_train = []
from tqdm import tqdm
for sentence in tqdm(train_data['document']):
    temp_X = []
    temp_X = okt.morphs(sentence, stem = True)
    temp_X = [word for word in temp_X if not word in stopwords]
    X_train.append(temp_X)
 
100%|██████████████████████████████████████████████████████████████████████| 146182/146182 [05:16<00:00, 462.51it/s]
In [108]:
X_test = []
for sentence in tqdm(test_data['document']):
    temp_X = []
    temp_X = okt.morphs(sentence, stem=True) # 토큰화
    temp_X = [word for word in temp_X if not word in stopwords] # 불용어 제거
    X_test.append(temp_X)
 
100%|████████████████████████████████████████████████████████████████████████| 48995/48995 [02:26<00:00, 333.85it/s]
In [110]:
print(X_train[:5])
 
[['아', '더빙', '진짜', '짜증나다', '목소리'], ['포스터', '보고', '초딩', '영', '줄', '오버', '연기', '조차', '가볍다', '않다'], ['너', '무재', '밓었', '다그', '래서', '보다', '추천', '다'], ['교도소', '이야기', '구먼', '솔직', '재미', '없다', '평점', '조정'], ['사이', '몬페', '그', '익살스럽다', '연기', '돋보이다', '영', '스파이더맨', '에서', '늙다', '보이다', '던', '커스틴', '던스트', '너무나도', '이쁘다', '보이다']]
In [111]:
print(X_test[:5])
 
[['굳다', 'ㅋ'], ['뭐', '야', '평점', '나쁘다', '않다', '점', '짜다', '리', '더', '더욱', '아니다'], ['지루하다', '않다', '완전', '막장', '임', '돈', '주다', '보기', '에는'], ['만', '아니다', '별', '다섯', '개', '주다', '왜', '로', '나오다', '제', '심기', '불편하다'], ['음악', '주가', '되다', '최고', '음악', '영화']]
In [113]:
from gensim.models import Word2Vec
In [114]:
model = Word2Vec(X_train,size=300,window=3, min_count = 5,workers=1)
In [115]:
word_vectors = model.wv
In [117]:
vocabs = word_vectors.vocab.keys()
word_vectors_list = [word_vectors[v] for v in vocabs]
In [118]:
print(word_vectors_list[:5])
 
[array([ 0.35371915,  0.01369796, -0.32485738, -0.07170954, -0.8098252 ,
        0.2438762 ,  0.08732416,  0.63042426,  0.81642616,  0.28534853,
       -0.3122959 , -0.33240107,  0.19078688, -0.01236966,  0.06174204,
        0.41557205, -0.08673642, -1.458105  , -0.32850906, -1.112841  ,
       -0.5430838 ,  0.40445134,  0.10489815,  0.6468941 ,  0.19937538,
       -0.60479933, -0.22478952,  0.7565353 , -0.14293832,  0.23310506,
        0.15129174,  0.6845647 , -0.7044086 ,  0.7915299 , -0.35835764,
       -1.2526863 ,  0.45404494, -0.04736769,  0.5942899 ,  0.59826547,
        0.36945054,  0.47753644, -0.09401665, -0.36907658,  0.23232228,
       -0.46641174, -0.27822164,  0.2882081 ,  0.04361428, -0.45328766,
        0.22835523, -0.36976653,  0.12345321,  0.14125885,  0.63989264,
       -0.19459705,  0.17302327,  0.35701352, -0.7929591 , -0.77486944,
        0.33167386, -0.63483286, -0.7358724 , -0.5367217 , -0.15168835,
       -0.23927116,  0.12286316, -0.5074119 ,  1.3054442 , -0.298176  ,
        0.53662306, -0.38420925,  0.13963717, -0.4172364 , -0.6023217 ,
       -0.737436  ,  0.10716676,  0.3178931 ,  0.58743054,  0.02297241,
       -0.15831982, -0.18929426,  0.20886979, -0.2715996 ,  0.43760517,
       -0.352647  , -0.06899021,  0.5870781 ,  0.15513158, -0.14012474,
        0.78900623,  0.15468648,  0.30054903,  0.4056097 , -0.33693022,
       -0.19352566, -0.40530896, -0.03303072, -0.36479527,  0.12306235,
        0.09257359, -0.18375042,  0.5485907 , -0.05983978,  0.01656177,
       -0.51319855,  0.27850905, -0.25046661,  0.08234145, -0.20900732,
       -0.5142531 , -0.11942197, -0.69790703,  0.34256333,  0.82137275,
        0.9430014 , -0.04666335,  0.39349815,  0.90695465, -0.40254754,
        0.50508046, -0.4258863 , -0.00949783, -0.61609435,  0.0400856 ,
       -0.29342127,  0.4765701 ,  0.14896019, -1.0968763 ,  0.20674203,
       -0.07628355, -0.22713113,  0.38102186, -0.09913569, -0.2546628 ,
        0.7605381 ,  0.00786115, -0.0531752 ,  0.29045078,  0.09086434,
       -0.09674405,  0.445838  ,  0.98627806, -0.15761015,  0.6493283 ,
       -0.09806692, -0.11581327,  0.31925744, -0.34285492,  0.09583668,
        0.2640743 ,  0.38297677, -1.3245939 ,  0.49583235, -0.3515406 ,
        0.5218009 , -0.6748311 , -0.521731  ,  0.09695819,  0.01711136,
        0.43111882,  0.19854143, -0.24784477,  0.75012076,  0.8029055 ,
        1.0587212 ,  0.2252498 , -0.7543822 ,  0.36283895, -0.6082155 ,
       -0.81244385,  0.12543415, -0.13457002,  0.61658704, -0.859349  ,
        0.8003935 ,  0.00818912, -0.26867083, -0.3244661 ,  0.25344384,
       -0.24900945,  0.18456884, -0.6389164 ,  0.3778808 ,  0.7558439 ,
        0.17785466,  0.37517542,  0.25370252,  0.251337  , -0.47009435,
        0.10807495,  0.6289351 , -0.8183315 , -0.77615744,  0.39609525,
       -0.23315974,  0.04918276,  0.879458  ,  0.08264337, -0.01516624,
       -0.19204898, -0.33442628, -0.24506453,  0.6312478 ,  0.28480405,
        0.1000071 ,  0.03199782,  0.76529694, -0.97053623,  0.26944557,
       -0.4845214 ,  0.24615394,  0.49071085, -1.1139286 , -0.09198507,
        1.1445796 , -0.48809683, -0.21912406, -0.31834188, -0.32249808,
       -0.17586514, -0.02084946,  0.36014786,  0.62940025,  0.19775295,
        0.4275967 ,  1.1811138 , -0.55622935, -0.28744274, -0.5984628 ,
       -0.46481133,  0.04407143, -0.44956017,  0.63453496,  0.09470233,
       -0.27863345, -0.20793761,  0.42128265, -0.8122096 ,  0.42522484,
        0.65584326, -0.13752533, -0.03722789, -0.43598282, -0.72522   ,
       -1.2605996 , -0.7767486 ,  0.46211237,  0.4232772 , -0.13487536,
        0.07996047,  0.5894034 ,  0.7337533 ,  0.5783463 , -0.27723   ,
        0.73198324, -0.5529127 ,  0.45494598,  0.85277545, -0.07826032,
       -0.3091203 ,  0.27356094,  0.44235778, -0.60123974,  0.1393649 ,
        0.29696792, -0.38528672,  0.38809997, -0.32254714, -0.6605602 ,
        0.7221839 , -0.01222237,  0.10330763, -0.16048087, -0.3377224 ,
        0.06903428, -0.20879556, -0.49042755, -0.2978255 ,  0.6634165 ,
       -0.0040673 , -0.30612382,  0.25746432,  0.01667893,  0.10528573,
       -0.9395296 , -0.5134512 , -0.17095241,  0.20566861,  0.19690408,
       -0.69362146,  0.46221745, -0.37353134,  0.17481603,  1.1287862 ,
       -0.94452304, -0.14353661, -0.54536766, -0.10933635, -0.8106252 ],
      dtype=float32), array([ 1.89083979e-01,  3.60822499e-01,  7.52017677e-01, -3.48109305e-01,
       -2.02179998e-01,  1.95758149e-01,  1.07987747e-02,  2.16522679e-01,
        7.80965388e-02,  5.46024323e-01,  4.47630286e-01, -2.50288010e-01,
        3.74688029e-01,  3.50882798e-01,  2.24708363e-01, -7.88937032e-01,
       -2.01267555e-01,  5.12138642e-02,  4.48362157e-02, -4.50382046e-02,
       -3.82816911e-01,  4.62848485e-01, -1.72528237e-01, -2.24339798e-01,
        1.38546079e-01,  4.27904785e-01,  5.54159954e-02, -2.41944958e-02,
       -8.99843201e-02, -1.19710870e-01,  5.64997494e-01, -1.27891228e-01,
       -6.04645252e-01,  4.59669173e-01, -4.07027364e-01,  3.06060016e-01,
        9.86440759e-03,  2.85450518e-01,  6.26932979e-02,  5.17451316e-02,
       -1.07931681e-01, -2.93999702e-01, -3.39162529e-01,  2.97916681e-01,
        1.00361094e-01, -4.27598834e-01, -1.60671473e-02,  2.47456923e-01,
       -4.32545543e-01,  6.28253996e-01, -1.87700942e-01,  7.77709335e-02,
        4.08763677e-01,  4.48305935e-01,  1.81456432e-01, -3.58734667e-01,
       -4.65122201e-02,  6.05657697e-01,  3.06532949e-01, -4.28248346e-01,
       -4.96852875e-01,  2.42814030e-02, -7.16932714e-02, -2.23727793e-01,
        4.29408193e-01,  4.77099359e-01,  5.99484265e-01, -1.14740156e-01,
        6.29065335e-02,  1.35380998e-01, -5.70586562e-01,  1.34728923e-01,
        6.01069510e-01,  4.44200873e-01,  1.23525038e-01, -2.21321955e-01,
       -2.08631530e-02, -1.22782357e-01,  7.99565017e-01,  6.71724677e-01,
        2.06830800e-01,  5.25520325e-01, -4.09425832e-02,  6.59353808e-02,
        1.61390305e-01, -4.82085794e-02, -6.60792232e-01,  4.08435129e-02,
        9.48701739e-01, -6.49247617e-02, -1.02402620e-01,  6.12356663e-01,
        1.68150201e-01, -5.48166558e-02, -1.96587890e-01, -1.38380036e-01,
        4.97825146e-01,  9.06539187e-02,  4.54917133e-01, -1.44691497e-01,
        5.43666899e-01,  5.55345118e-01,  2.48864308e-01,  2.88786888e-01,
        1.22581087e-01,  5.48054218e-01, -1.11334085e-01, -5.59164919e-02,
       -1.10279441e-01,  2.65116952e-02, -1.78087041e-01,  1.03461698e-01,
       -1.68309927e-01, -2.93364227e-01, -2.80314744e-01,  2.45657071e-01,
       -2.52108127e-01,  3.91797513e-01,  4.24634784e-01,  5.21256104e-02,
        4.64151762e-02, -5.96687376e-01, -6.31786361e-02, -6.06046140e-01,
        1.90468788e-01, -1.21336170e-01, -7.54087031e-01,  2.63654273e-02,
        4.64102983e-01,  2.20393002e-01,  8.37704122e-01,  4.70884055e-01,
        4.52386260e-01,  1.01350799e-01, -3.01946342e-01,  4.23438996e-01,
        1.31241590e-01,  1.35735497e-02,  3.93659145e-01,  1.78351849e-01,
        2.05589101e-01,  7.72399083e-02,  2.03281045e-01, -4.43456806e-02,
        8.00276220e-01,  6.42984360e-02, -1.06155254e-01, -4.29555714e-01,
        1.17843553e-01, -3.66424173e-01,  4.12188023e-01,  5.10199904e-01,
        6.58843964e-02,  1.09729791e+00, -3.45090717e-01, -3.15299600e-01,
       -6.50117040e-01, -3.36591393e-01, -2.02511139e-02, -4.17784750e-01,
        4.40176070e-01, -1.68363184e-01,  1.93045944e-01,  6.30960047e-01,
       -6.43173218e-01,  1.08500674e-01,  3.70048136e-01, -1.36020675e-01,
        7.43669510e-01, -4.15490717e-01, -2.86625862e-01, -1.11983165e-01,
       -5.71260989e-01, -3.50878447e-01,  2.43937269e-01, -1.03231566e-02,
        4.78787795e-02,  2.61154830e-01, -2.11821795e-02,  1.71315491e-01,
       -1.47004545e-01, -1.02724321e-01, -3.61775905e-01,  1.58845782e-01,
        7.29539216e-01,  5.56319952e-01,  7.01053888e-02,  1.37133464e-01,
       -9.07938629e-02,  2.31083959e-01,  9.29590017e-02,  2.24343970e-01,
        1.54447705e-01, -3.78274135e-02,  2.42071837e-01,  2.19196320e-01,
       -2.31441006e-01, -3.40591252e-01, -1.16538942e-01,  1.70130357e-01,
        1.34081095e-01,  3.77010345e-01,  1.62052557e-01, -8.95812293e-04,
        2.23923456e-02, -3.78147155e-01, -2.00190708e-01,  1.70119047e-01,
        2.55231738e-01,  1.63391501e-01,  1.94264725e-01, -3.88408959e-01,
        4.85834688e-01, -3.59457850e-01, -3.57690789e-02,  7.16300979e-02,
        6.21361017e-01, -3.64450753e-01,  4.95926887e-02, -1.48590758e-01,
        2.88025022e-01,  8.92031193e-03, -2.64723122e-01,  7.38198757e-02,
       -2.35711951e-02, -2.64820993e-01,  6.85107172e-01, -1.92027554e-01,
        2.45485291e-01,  1.14059545e-01, -2.33155559e-03,  1.41599774e-01,
       -7.72717416e-01,  6.92200601e-01, -1.16550513e-01, -3.99360918e-02,
        3.10919285e-01,  5.91386259e-01,  1.99278906e-01, -3.24863017e-01,
       -3.21406066e-01, -1.63486227e-01, -1.54580042e-01,  4.73143607e-02,
       -3.95469032e-02, -1.56700492e-01, -1.27931848e-01,  4.95056421e-01,
       -2.79918224e-01, -2.99590886e-01, -3.33879501e-01,  1.78528409e-02,
        1.50827885e-01, -1.58116594e-01,  9.14419740e-02, -4.13884759e-01,
        1.48924038e-01, -4.60464060e-01, -7.44639456e-01,  1.86095953e-01,
       -4.21039194e-01,  3.79884243e-01, -1.77336127e-01,  1.07568644e-01,
       -1.49131283e-01, -2.46030480e-01, -2.56137639e-01, -3.05458397e-01,
       -1.83648095e-01,  1.70516133e-01,  4.38269198e-01,  3.54776420e-02,
        7.08496213e-01,  1.81088760e-01, -1.25971735e-01, -7.46804595e-01,
        5.11034206e-02,  8.17973986e-02,  2.89150048e-02, -4.08525718e-03,
       -3.29116434e-01, -2.40118444e-01, -3.17283869e-01,  3.64307053e-02,
        3.74641150e-01,  8.18490386e-02, -4.90651615e-02,  1.63533837e-01,
        6.41686916e-02,  1.45682320e-01, -1.28867716e-01,  3.94148789e-02,
        3.01715016e-01, -2.82113016e-01,  4.47516203e-01, -5.28098106e-01,
       -2.74513572e-01, -7.20592678e-01,  2.84156859e-01, -4.48707342e-01],
      dtype=float32), array([ 2.72741646e-01,  5.59471309e-01,  4.97208387e-01, -2.29130819e-01,
       -3.20808321e-01,  3.68277401e-01,  1.71144996e-02,  7.50725985e-01,
       -3.41687769e-01, -3.71341348e-01,  6.23047292e-01,  2.89956391e-01,
        3.14290732e-01, -1.45605123e-02,  1.40382290e-01, -4.07024741e-01,
        1.33055419e-01, -2.43933380e-01, -1.44910216e-01, -3.37201715e-01,
       -1.44952998e-01, -2.40601644e-01,  2.47322351e-01,  2.88669020e-01,
        7.44223893e-01, -2.67053574e-01,  4.31541651e-01,  1.05795503e-01,
        4.66114730e-01, -2.24967778e-01,  3.63573432e-01,  3.04573447e-01,
       -1.13266230e-01, -1.24657415e-01,  3.12574834e-01, -1.44246832e-01,
        2.94679284e-01,  4.40773666e-01,  2.47264832e-01, -8.54680128e-03,
       -2.55212337e-01,  5.07281184e-01, -2.30407119e-01, -3.90755087e-01,
        3.12610120e-01, -5.29857934e-01, -2.16198370e-01, -4.26115900e-01,
       -7.46489242e-02,  6.35282576e-01, -4.47295427e-01, -9.90432501e-01,
       -1.17460109e-01, -1.31746054e-01,  3.54048997e-01,  3.08320045e-01,
        5.28045237e-01, -2.42500573e-01,  7.59225190e-02, -4.75058347e-01,
       -8.56742859e-02, -2.67149657e-01,  3.19465935e-01,  4.84808445e-01,
        6.09682977e-01, -1.31521791e-01,  4.18893188e-01, -3.31940293e-01,
        5.42169273e-01, -4.15439339e-04,  9.76526067e-02,  5.27143061e-01,
        3.30832928e-01, -9.67895165e-02, -2.52540499e-01, -5.21479473e-02,
       -4.43033785e-01,  1.40004396e-01,  1.55801281e-01,  4.79818374e-01,
        5.14537573e-01,  5.98484576e-01, -1.01161227e-01, -2.01351926e-01,
        5.31843841e-01,  6.31995872e-02, -6.41374826e-01,  2.19621137e-01,
        6.75304413e-01, -2.95791216e-02,  1.46914925e-02,  1.87246397e-01,
       -3.75772983e-01,  1.13071248e-01,  6.92664012e-02, -2.32847676e-01,
        4.59760381e-03, -7.54033178e-02,  4.11273092e-02,  3.56095493e-01,
        1.04498519e-02,  8.47651601e-01, -1.62174404e-01,  2.88561732e-01,
        3.93661588e-01, -3.53360891e-01, -4.33149785e-01, -4.57956076e-01,
        7.46447444e-02,  2.41559461e-01,  4.99245971e-01, -3.65617573e-01,
       -7.53247917e-01,  1.03651963e-01,  4.76928800e-01, -1.90303817e-01,
       -4.00226027e-01,  2.61735916e-01,  8.47440138e-02,  6.15003586e-01,
       -3.85171682e-01, -3.48686352e-02,  4.30179536e-01, -2.04731762e-01,
       -5.32923222e-01,  2.78932720e-01, -1.11470241e-02,  1.32232919e-01,
       -1.00497395e-01, -2.42616042e-01,  1.47589952e-01,  5.21141171e-01,
        3.44758302e-01, -4.93468791e-01,  5.14328837e-01,  6.71053588e-01,
        4.20176759e-02,  3.72409523e-02, -4.44082543e-02,  2.20780447e-01,
        3.26485574e-01, -9.60033312e-02, -1.34973466e-01, -9.50792581e-02,
        6.69785738e-02, -1.33209124e-01,  5.29533148e-01,  4.37356979e-01,
       -1.20613441e-01, -2.65428334e-01,  1.14814997e-01,  3.56741488e-01,
        5.29107332e-01,  9.27158237e-01,  9.13419053e-02, -4.87367436e-02,
        5.40176146e-02, -2.78413862e-01,  5.64327501e-02, -4.51262206e-01,
        3.59087408e-01, -3.14880013e-01,  5.68484008e-01,  3.08661789e-01,
        3.73058587e-01,  5.69775999e-01,  8.26771021e-01, -3.68545681e-01,
        8.00820887e-01, -7.78652608e-01, -7.47129023e-01,  2.99532384e-01,
       -7.39989150e-03, -2.91260064e-01, -3.53357613e-01,  1.74712837e-01,
       -1.99199855e-01,  3.21527213e-01, -3.42435688e-01, -2.02249184e-01,
       -1.62893593e-01, -1.53568938e-01, -7.05490530e-01,  3.30612399e-02,
       -2.96956241e-01,  1.26815394e-01, -3.54179561e-01,  1.36005998e-01,
        3.76607209e-01,  9.34403986e-02, -3.41970474e-01,  1.17634202e-03,
       -6.82142437e-01,  1.89872608e-02,  4.95069861e-01,  1.66302137e-02,
       -2.86396950e-01,  2.55062014e-01,  1.04499921e-01, -2.73571253e-01,
       -4.20333445e-01,  3.85120451e-01, -3.82735014e-01, -4.37125266e-02,
        1.66400507e-01, -1.27763888e-02, -5.37047207e-01,  5.09436011e-01,
       -4.45059240e-01, -5.70679367e-01,  2.37025961e-01, -1.61120757e-01,
        1.22028850e-01, -6.66551232e-01, -1.32800385e-01,  3.86158466e-01,
       -5.56773245e-01,  4.10206437e-01, -6.03408933e-01,  6.14461482e-01,
        1.98407069e-01,  4.43106204e-01,  9.23852026e-02,  7.63567770e-03,
        1.86919197e-02,  9.40627009e-02,  1.95024848e-01, -4.85571474e-02,
        1.21599652e-01,  7.28813633e-02, -1.64345577e-01,  4.13775682e-01,
       -3.93240899e-01, -2.61081755e-02, -6.48665845e-01, -5.45462549e-01,
        1.07441083e-01,  9.41287354e-02, -5.69380214e-03, -2.96557993e-01,
        2.72846501e-02, -1.76361904e-01,  5.09363234e-01, -5.14763296e-01,
       -5.98788261e-01,  8.78156573e-02, -8.99333417e-01,  8.76325309e-01,
        1.78669885e-01, -7.32484758e-01,  1.07217036e-01,  5.14497101e-01,
        2.36716270e-01, -3.72894913e-01, -7.36626565e-01, -1.27081931e-01,
        9.37746167e-01,  2.79267848e-01, -8.34866703e-01, -1.75002977e-01,
        3.23209018e-02, -4.19930190e-01, -7.13510811e-01, -1.06480353e-01,
       -1.85092404e-01,  2.05796421e-01,  1.72480941e-01,  2.93205917e-01,
       -1.27772335e-02, -4.98986512e-01, -3.25183421e-01, -3.91801000e-01,
        8.01507175e-01,  9.24432352e-02,  2.74052113e-01, -1.35584801e-01,
        5.85335791e-01,  1.28629040e-02, -2.08709881e-01,  1.05944924e-01,
       -4.93066572e-02, -2.95732051e-01, -7.66936541e-01, -3.12718749e-01,
        1.87949345e-01,  3.36926393e-02, -1.98614448e-02,  4.49714780e-01,
        1.03951491e-01,  3.57141167e-01, -2.49375671e-01,  1.58711210e-01,
       -1.35224596e-01,  1.69218015e-02,  8.63822520e-01, -1.64794073e-01,
       -2.71235675e-01,  9.45416540e-02,  2.17342198e-01, -2.20280439e-01],
      dtype=float32), array([ 6.03196584e-02,  5.89848980e-02,  6.71462059e-01, -1.36922643e-01,
       -4.97979522e-02,  1.68914258e-01, -6.47643134e-02,  6.28605366e-01,
       -1.07665658e-01,  3.00725698e-01,  5.39558351e-01,  1.55261889e-01,
       -2.21495688e-01,  1.44634089e-02,  3.34076554e-01, -7.40331352e-01,
       -1.40353844e-01, -7.08135605e-01,  2.59288996e-01, -1.55486345e-01,
       -4.10178721e-01,  7.84334391e-02, -1.07957721e-01, -9.91119351e-03,
        6.33539120e-03,  2.83227146e-01, -4.45772111e-01,  1.84048176e-01,
       -2.89564073e-01,  1.47172019e-01, -9.91203859e-02,  5.62133908e-01,
       -2.67597586e-01, -8.39236975e-02, -5.01434147e-01, -7.80926049e-02,
       -3.84808667e-02, -2.31444359e-01,  5.35222411e-01,  7.52962172e-01,
        4.46749598e-01, -3.62788320e-01, -7.15136468e-01, -1.13915861e-01,
        1.11153081e-01, -6.71019256e-01, -3.26816291e-01,  5.07549234e-02,
       -1.58542514e-01,  1.73447415e-01, -4.57953662e-01, -6.78138614e-01,
        3.83386284e-01,  2.28332803e-01, -1.77043766e-01, -3.45731944e-01,
        4.26626019e-02,  5.89942157e-01, -7.27468431e-02, -6.11540794e-01,
        2.35552102e-01,  1.82787493e-01,  9.87255760e-03,  1.17773451e-01,
       -2.94847399e-01,  7.67524987e-02,  3.86099637e-01, -3.34918708e-01,
        3.60376358e-01, -1.88560918e-01,  3.07276864e-02, -1.45878986e-01,
        6.66436911e-01,  1.87658876e-01, -2.27986708e-01, -7.07703173e-01,
        7.82628357e-02,  2.45766684e-01,  4.61410224e-01,  3.80710453e-01,
        3.94200832e-01, -3.14014703e-01, -2.33524203e-01,  3.52099314e-02,
        5.89791574e-02,  9.48056206e-02, -3.71192992e-01,  2.54832387e-01,
        8.54609013e-02, -5.59338629e-01,  2.08680958e-01,  8.55490267e-02,
       -5.90835363e-02, -7.89842680e-02, -4.91275638e-03, -4.13997084e-01,
        9.55096126e-01,  5.32095790e-01,  7.32371211e-01,  7.64514878e-02,
        2.86072701e-01,  4.07001197e-01,  1.59877762e-01, -2.72961974e-01,
       -2.90269136e-01,  2.42209304e-02, -2.54579395e-01,  5.29886782e-01,
       -2.08937883e-01,  7.06858700e-03, -5.49211860e-01,  7.72508159e-02,
       -3.01477134e-01, -6.59624413e-02,  5.18411584e-02, -3.79458070e-01,
       -1.97927132e-01,  7.94770598e-01,  2.89951041e-02, -8.74903679e-01,
       -4.35683638e-01, -4.79629874e-01, -8.67981136e-01, -5.86183429e-01,
        1.58078551e-01, -1.49595113e-02,  2.87061453e-01, -5.36117554e-01,
        2.37673476e-01,  3.59247923e-01,  5.03348172e-01, -4.27619010e-01,
        2.59680331e-01, -9.59085763e-01, -1.64522991e-01,  3.13918084e-01,
       -3.17220874e-02,  7.88855255e-01, -1.64739396e-02,  7.33363926e-01,
       -5.51535547e-01, -4.22567368e-01, -4.17627782e-01,  1.39985085e-01,
        2.41581097e-01, -4.82849598e-01,  1.10983081e-01, -5.94322443e-01,
        1.17465243e-01,  9.37577859e-02,  5.43506369e-02,  2.64011830e-01,
       -4.48220462e-01,  1.04658532e+00,  1.54466271e-01, -7.32473850e-01,
       -3.63512963e-01, -2.02750504e-01,  6.77849865e-03,  4.55302179e-01,
        2.91411489e-01, -9.53928381e-02,  4.54726756e-01,  4.35817391e-01,
        5.82312405e-01,  5.52858829e-01,  3.12287003e-01, -4.32709903e-01,
        4.78368104e-01, -4.96160060e-01, -1.30568898e+00, -3.05571053e-02,
       -4.17205036e-01,  5.40552707e-03, -2.35717837e-02,  2.80974895e-01,
       -1.80961847e-01, -6.76413924e-02, -1.38566226e-01,  3.43638688e-01,
       -2.25209426e-02, -5.20227365e-02, -4.75014269e-01,  5.70716672e-02,
       -4.21347357e-02,  2.05003526e-02, -8.44344422e-02,  6.89909101e-01,
        4.62041378e-01, -7.46747553e-02, -2.20212322e-02,  6.96773946e-01,
       -4.08322096e-01, -1.52332842e-01,  1.69781849e-01, -4.17430932e-03,
        4.04096901e-01, -6.67110607e-02, -4.17551398e-02,  3.54242474e-01,
       -2.70383656e-02, -3.98979247e-01,  2.41106436e-01, -4.40965146e-01,
       -4.11580265e-01,  3.68709624e-01, -2.40588766e-02,  7.99415782e-02,
       -1.51831910e-01,  6.93504438e-02,  5.70186973e-01, -1.53628454e-01,
       -1.78356752e-01,  2.55550388e-02, -1.63382441e-01,  3.99868220e-01,
        3.88028026e-01,  3.17336947e-01, -2.72428453e-01,  5.01622200e-01,
        3.05703193e-01,  2.60679610e-02, -2.05755234e-01, -1.86956525e-01,
       -1.39118545e-03,  5.57816803e-01,  3.22988361e-01, -6.32455349e-01,
        2.00472444e-01,  1.31961077e-01,  3.42345119e-01,  1.29957110e-01,
       -8.19237173e-01,  2.20708083e-02, -5.33763729e-02, -7.38726795e-01,
        2.60268092e-01,  9.17707443e-01,  2.97894120e-01, -2.13945538e-01,
       -1.67526901e-01, -2.83442169e-01,  2.08203942e-01,  7.81320557e-02,
        4.62610759e-02,  1.19252130e-01, -7.17722416e-01,  1.37458667e-02,
       -1.51493654e-01, -3.49560201e-01,  1.11902565e-01, -1.34107888e-01,
        4.36691165e-01, -1.53211027e-01, -1.05859257e-01, -2.86821634e-01,
        4.92000312e-01,  5.81809342e-01, -2.72545904e-01,  2.15444088e-01,
        3.30700517e-01,  1.88090548e-01,  1.35857403e-01, -2.31688157e-01,
       -6.08850159e-02,  4.09037024e-01,  2.71765500e-01,  1.78906828e-01,
       -6.53467715e-01, -3.19693863e-01, -3.57187957e-01, -5.24738468e-02,
        5.74530005e-01, -1.49983123e-01, -1.43374801e-01, -3.55764329e-01,
        3.01326811e-01, -5.23645505e-02, -3.05772185e-01, -5.72189949e-02,
       -3.91847342e-01, -1.46970525e-01, -7.09563494e-01, -2.65069515e-01,
       -1.46047592e-01,  2.35722899e-01, -5.64781308e-01, -2.08958209e-01,
       -1.74739301e-01,  1.20109782e-01,  1.99529067e-01, -3.24008584e-01,
       -3.25309373e-02, -4.36910659e-01,  8.06470156e-01, -2.46637955e-01,
       -1.33475900e-01,  5.04725671e-04,  7.77088344e-01, -4.22014803e-01],
      dtype=float32), array([ 2.17656240e-01, -4.56977934e-02,  1.79055318e-01, -1.82327196e-01,
       -3.85262728e-01, -7.41048232e-02, -1.53223500e-01,  2.62788422e-02,
        2.74636567e-01,  2.69390911e-01,  1.26117319e-01,  2.16805577e-01,
        2.52055407e-01,  7.41537735e-02,  4.61898714e-01, -4.40649122e-01,
       -7.69050866e-02,  5.15398532e-02, -6.11095354e-02,  2.11160388e-02,
       -2.90314555e-01, -3.71565640e-01, -2.48965487e-01, -2.72989661e-01,
       -2.78231025e-01,  1.23519577e-01, -3.64305936e-02,  2.16850266e-03,
       -2.25915670e-01,  2.10577309e-01, -1.31380290e-01, -2.52300650e-01,
        1.50190249e-01,  1.63318530e-01, -3.38114232e-01, -3.34840089e-01,
        1.95475504e-01,  2.93124318e-01,  2.58704871e-02, -1.62718043e-01,
        3.69364530e-01, -2.91850150e-01, -6.33109659e-02, -9.59895924e-02,
        1.91528141e-01, -3.33847493e-01,  4.93291467e-01, -4.16541100e-02,
       -3.74570638e-01,  5.32945991e-01, -5.40652454e-01, -7.40479380e-02,
        5.84290922e-01, -2.73714453e-01,  1.42547384e-01, -4.18026187e-03,
        4.91150143e-03, -5.78749120e-01,  3.71671617e-01, -5.31067215e-02,
       -2.12432384e-01,  5.80287993e-01, -2.49311954e-01,  6.09517545e-02,
        2.40680799e-01,  8.13592807e-04,  6.12078011e-01, -1.52163431e-01,
        1.72377184e-01,  3.64941835e-01, -2.98058778e-01,  3.08768097e-02,
        7.92064309e-01, -1.97341107e-02, -1.26708392e-02, -4.03093368e-01,
        5.59222847e-02,  3.37912291e-01,  3.54329526e-01,  3.95138383e-01,
       -2.87782568e-02, -6.38584569e-02, -4.10286933e-01, -9.64911282e-02,
        1.91933569e-03,  1.29248817e-02, -3.91132861e-01,  3.24332148e-01,
        6.08230114e-01,  6.51623383e-02,  2.64245033e-01,  5.10083973e-01,
       -6.36350960e-02,  1.24863766e-01,  1.47031412e-01, -2.75891095e-01,
        7.07865581e-02,  4.45132226e-01,  1.64063707e-01,  4.23206419e-01,
        7.64952898e-01,  9.40596521e-01,  2.25024194e-01,  5.26319779e-02,
        1.44661397e-01,  3.72110724e-01, -1.47175357e-01,  1.94228049e-02,
       -5.28643191e-01, -3.12528372e-01,  4.28589731e-02, -7.58089125e-02,
       -3.69820774e-01, -1.90999016e-01, -2.87862509e-01, -4.44615424e-01,
       -3.70163262e-01, -2.01504812e-01,  1.75851926e-01, -6.53674066e-01,
       -1.08273111e-01, -5.29378355e-01, -6.92961812e-02, -4.37651932e-01,
        2.74295181e-01, -3.21805179e-01, -1.49035186e-01,  9.59187075e-02,
        4.55997288e-01, -2.02970892e-01,  3.64691257e-01, -3.71115625e-01,
        2.36435473e-01, -6.77121952e-02, -7.91595653e-02,  3.81101593e-02,
        4.58155513e-01,  7.22746253e-01,  2.61120051e-01,  6.56355262e-01,
        2.30613187e-01, -2.21043423e-01, -6.80030584e-02,  5.21974638e-04,
        9.31287557e-02, -1.14787936e-01, -8.07324797e-02, -2.28290230e-01,
        6.97241053e-02,  1.37614191e-01,  1.80701941e-01,  2.05788925e-01,
       -1.26375630e-01,  7.43233562e-01, -3.31103027e-01, -2.63826042e-01,
       -7.29277134e-01, -1.90716714e-01,  2.94557482e-01, -7.53001869e-02,
        2.10046202e-01, -1.88887477e-01, -1.79407835e-01, -2.01150030e-01,
       -8.39992762e-02, -1.11905202e-01,  9.39044058e-02, -8.15895870e-02,
        1.26185894e-01, -2.42266759e-01, -5.53619742e-01, -4.15588737e-01,
       -5.27898550e-01,  1.42398495e-02, -4.66146767e-02, -5.08190989e-01,
        7.27347910e-01,  1.99283645e-01,  2.58794352e-02,  3.49124968e-02,
       -7.47016147e-02, -4.47184360e-03, -6.93787634e-01,  3.32751274e-01,
        6.46449983e-01,  1.49342716e-01, -1.96989864e-01, -3.00018728e-01,
        6.93749562e-02,  1.50969028e-01, -3.70788157e-01, -1.26695916e-01,
       -4.35676575e-01,  2.63285458e-01,  3.72674584e-01,  8.96443501e-02,
       -5.75118884e-02,  1.77178726e-01,  1.52940601e-01,  2.45767668e-01,
       -3.16196471e-01,  2.78230160e-01, -2.41287217e-01, -4.85016942e-01,
        4.82007235e-01,  1.50377288e-01,  3.51905636e-02,  1.41834259e-01,
       -2.82464594e-01,  6.74631894e-02,  1.56809658e-01, -2.21668512e-01,
        3.69675547e-01, -1.25314519e-01, -2.46489659e-01,  2.35017046e-01,
       -4.15154904e-01,  1.12579159e-01, -1.99675299e-02, -1.99912861e-01,
        1.43613011e-01,  3.04880828e-01, -1.79036528e-01, -8.24095830e-02,
       -3.86389285e-01,  2.38964140e-01,  4.97853369e-01, -3.15055162e-01,
        3.56375426e-02,  2.22352415e-01, -2.95332521e-01, -8.07289928e-02,
       -2.44374037e-01,  4.55071241e-01, -1.18847750e-01, -3.07038128e-01,
       -3.16033214e-01,  7.52053976e-01, -9.13181454e-02,  5.90750463e-02,
       -2.93274343e-01,  1.63834214e-01,  7.93743134e-02, -1.24472566e-01,
       -3.57467651e-01,  2.37572268e-01,  1.69744685e-01,  7.17308000e-02,
        7.89930969e-02, -1.78745389e-01,  2.55079061e-01,  2.30893772e-02,
       -1.35140093e-02, -5.18991172e-01, -4.20268476e-01, -1.93794537e-02,
        2.30221346e-01, -4.30002362e-02, -5.40678382e-01, -1.41068086e-01,
       -3.21468376e-02, -3.58353220e-02, -1.71644360e-01, -1.33488588e-02,
       -4.30137962e-01,  2.95036316e-01,  4.26383410e-03,  1.09888256e-01,
       -3.85551423e-01,  4.25162226e-01, -1.99403971e-01,  1.60104781e-01,
        2.85311252e-01,  1.04979314e-01, -1.49682418e-01, -2.72936910e-01,
        5.92215002e-01, -2.80328155e-01, -4.05921936e-01, -9.34486017e-02,
       -5.08446321e-02,  1.61741581e-02,  1.24515124e-01, -4.22494888e-01,
        6.24823906e-02, -2.43669033e-01,  1.12264566e-01,  6.12834752e-01,
       -2.10650682e-01, -2.16908336e-01, -3.23359042e-01, -3.01681042e-01,
        8.65234658e-02, -4.51513857e-01, -1.00729354e-01,  5.47674187e-02,
       -8.28587115e-02, -6.48742676e-01,  4.97995645e-01, -3.09794247e-01],
      dtype=float32)]
 

임베딩 결과 시각화

 
  1. PCA
In [119]:
from sklearn.decomposition import PCA
In [120]:
pca = PCA(n_components=2)#300차원 벡터를 2차원으로 축소
In [121]:
xys = pca.fit_transform(word_vectors_list)
xs = xys[:,0]
ys = xys[:,1]
In [122]:
print(xys[:5])
 
[[-2.65700024  2.08231725]
 [-1.4255432   0.63475804]
 [-1.54043232  1.2095613 ]
 [-0.9708202   1.72173189]
 [-0.84313219 -1.06927979]]
In [123]:
print(xs[:5])
 
[-2.65700024 -1.4255432  -1.54043232 -0.9708202  -0.84313219]
In [124]:
print(ys[:5])
 
[ 2.08231725  0.63475804  1.2095613   1.72173189 -1.06927979]
In [127]:
def plot_2d_graph(vocabs, xs, ys):
    plt.figure(figsize=(25,15))
    plt.scatter(xs, ys, marker = 'o')
    for i,v in enumerate(vocabs):
        plt.annotate(v, xy=(xs[i], ys[i]))
In [129]:
plot_2d_graph(vocabs, xs, ys)
 
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 30002 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 30002 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 20013 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 20013 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 20154 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 20154 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 25925 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 25925 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 20035 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 20035 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 22899 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 22899 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 28961 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 28961 missing from current font.
  font.load_char(ord(s), flags=flags)
 
 

PCA를 사용한 차원축소는 cluster를 분별하기가 힘들다.-> t_SNE

In [130]:
from sklearn.manifold import TSNE
In [131]:
model2 = TSNE(learning_rate = 100)
transformed = model2.fit_transform(word_vectors_list)

xs2 = transformed[:,0]
ys2 = transformed[:,1]

plt.figure(figsize=(28,21))

plt.scatter(xs2,ys2)

for i,v in enumerate(vocabs):
    plt.annotate(v, xy=(xs2[i],ys2[i]))
    
plt.show()
 
C:\Users\saehee jeon\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 30002 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 30002 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 20013 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 20013 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 20154 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 20154 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 25925 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 25925 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 20035 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 20035 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 22899 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 22899 missing from current font.
  font.load_char(ord(s), flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:211: RuntimeWarning: Glyph 28961 missing from current font.
  font.set_text(s, 0.0, flags=flags)
C:\Users\sh\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:176: RuntimeWarning: Glyph 28961 missing from current font.
  font.load_char(ord(s), flags=flags)
 
In [62]:
print(len(X_test))
print(len(y_test))
 
48650
48306
728x90
반응형

댓글