논문리뷰 4. 주가 예측(2)

논문/논문 리뷰

논문리뷰 4. 주가 예측(2)

p-jiho 2023. 4. 21. 16:08

이전 게시글에 이어 시계열 데이터를 예측해볼 것이다.

이번에 리뷰한 논문은 데이터도 새로 구하여 했기때문에 소개해줄 내용이 많아 따로 적게 되었다.

4. Stock price prediction based on LSTM neural network: the effectiveness of news sentiment analysis, 2020

먼저, py 파일로 NYT API의 뉴스를 수집하는 function을 생성한다.

   try:
        key = "회원가입 후 API 사용을 위해 받은 키 입력"
        year = year_month[0]
        month = year_month[1]
        nyt_base_url = "https://api.nytimes.com/svc/archive/v1/" + str(year) + "/" + str(month) + ".json?api-key=" + key
        nyt_base_url_result = requests.get(nyt_base_url)
        nyt_base_url_text = nyt_base_url_result.text
        nyt_news = json.loads(nyt_base_url_text)
    except Exception:
        return None
    finally:
        time.sleep(10)

먼저, https://jiho-0728.tistory.com/4 이 링크에 들어가 NYT API를 사용할 수 있도록 key를 받는다.

그리고 year과 month가 리스트로 구성된 year_month에서 year과 month의 값을 받는다.

그리고 news를 불러올 기본 url을 지정한다. requests를 이용해 url의 값을 받아오고 JSON 형식으로 되어있는 text를 load 해준다.

time.sleep 함수는 순차적으로 계속 데이터를 받아올 때 많은 양을 계속 받으면 error가 생길때가 있었다.

그래서 무리를 주지않고 받아오기 위해 설정하였다.

nyt_title = list(map(lambda x: x["headline"]["main"], nyt_news["response"]["docs"]))
nyt_date = list(map(lambda x: x["pub_date"], nyt_news["response"]["docs"]))

그리고 map과 lambda를 이용하여 headline과 date를 긁어온다.

docs를 보면 여러 기사들이 들어있는 것을 볼 수 있다.

기사 하나하나의 headline을 긁어오기 위해 map 함수를 사용했다.

nyt_headline_date = []
for i in range(len(nyt_title)):
    nyt_headline_date.append([nyt_title[i], nyt_date[i]])

nyt_headline_date = list(map(lambda x: "" if x[1]=="" else ("" if str(x[1])<"2012-01-01" or str(x[1])>="2022-05-01" else x), nyt_headline_date))
nyt_headline_date = [v for v in nyt_headline_date if v]

headline과 date를 한번에 list로 넣은 nyt_headline_date를 만든다. append를 사용하면 list의 요소를 하나씩 넣어줄 수 있다.

그리고 date가 2012년 1월 1일 ~ 2022년 5월 1일 전까지의 데이터만 선별한다. 기간에 해당하지 않는 데이터는 빈칸으로 생성한다.

그리고 for문을 이용해 값이 존재하지 않으면 삭제를 해 데이터를 뽑아내준다.

그 후 불용어 제거, 구두점 제거, 표제어 추출, 소문자 변환, 토큰화를 진행한다.

데이터 수집 연도와 월을 결합한 year_month를 입력값으로 받고, NYT API 뉴스 수집부터 토큰화까지 실행시켜주는 함수를 만든다.

두번째, 감성분석 score을 생성하는 파일을 만든다.

import nyt_collection_func as nyt_func
import datetime
import itertools
import pandas as pd
import numpy as np

from multiprocessing import Pool
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob

패키지를 로드해준다. nyt_collection_func은 위에서 생성한 py파일을 의미한다.

year = list(map(str, list(range(2012,2023))))
month = list(map(str, list(range(1,13))))

year_month = []
for i in range(len(year)):
    for j in range(len(month)):
        year_month.append([year[i], month[j]])
year_month = year_month[0:124]

수집할 연도와 월을 지정한 list를 생성한다. 2012년 1월부터 2022년 4월까지 데이터를 수집하도록 한다.

start = datetime.datetime.now()
nyt_collection_result = list(map(lambda x:nyt_func.nyt_preprocess(x), year_month))
print(datetime.datetime.now()-start)

아까 생성한 NYT API 데이터를 수집하는 func을 실행시켜준다.

실행시간은 약 30분 정도가 걸렸다. 병렬처리를 하려고 시도했으나 한번에 많은 요청이 들어와 error가 생겼다. 그래서 병렬처리는 사용하지 않고, time.sleep도 넣어주어서 딜레이를 걸어 error가 생기지 않도록 했다.

nyt_headline_date = list(map(lambda x: x[0], nyt_collection_result))
nyt_headline_token = list(map(lambda x: x[1], nyt_collection_result))

nyt_headline_date = list(itertools.chain.from_iterable(nyt_headline_date))
nyt_headline_token = list(itertools.chain.from_iterable(nyt_headline_token))

출력값이 token과 date가 합쳐서 나오도록 했다. 그래서 map과 lambda를 이용해 결과값을 분류하였다.

그리고 리스트의 차원을 줄여 각 요소가 월별로 나눠지는 것이 아닌 각각의 기사로 나눠지도록 설정했다.

frequent_headline = pd.Series(list(itertools.chain(*nyt_headline_token))).value_counts()
headline_less_frequent_word = frequent_headline[frequent_headline<=10]

def del_less_frequent_headline(line):  ## 리스트의 한 line  ["a","b","c"]
    data = nyt_func.difference(line, list(set(line)& set(headline_less_frequent_word.index)))   ## 불용어 사전과 동일하게 각 line과 불용어의 교집합 사전을 만들어 해당되는 값 제거
    return data

def del_less_headline_fun(headline):
    headline = list(map(del_less_frequent_headline, headline))
    return headline

def main():
    num_cores = 28
    lst = nyt_headline_token
    lst_split = np.array_split(lst,num_cores)
    pool = Pool(num_cores)
    lst = pool.map(del_less_headline_fun, lst_split)
    pool.close()
    pool.join()

    return lst

if __name__ == "__main__":
    start = datetime.datetime.now()
    nyt_headline_token = main()
    print(datetime.datetime.now()-start)

그리고 각 기사의 token을 한번에 풀어쓰고 빈도수를 계산해 11 이상인 token만 사용하도록 설정하였다.

병렬처리를 해서 빈도수가 낮은 문자열은 삭제시킨다.

del_less_frequent_headline은 하나의 리스트 요소와 빈도수가 낮은 문자열들 중 교집합이 되는 문자열들을 추출한 후 해당되는 값들을 제거하는 함수이다.

for i in range(len(nyt_headline_token)-1):
    i += 1
    nyt_headline_token[0].extend(nyt_headline_token[i])
nyt_headline_token = nyt_headline_token[0]

병렬처리를 위해 데이터를 나눴으므로 차원이 다르다. 그래서 nyt_headline_token의 첫번째 요소에 나머지 요소들을 붙이고 첫번째 요소만 빼오는 방법으로 데이터를 다시 재정비하였다.

nyt_headline_date = list(map(nyt_func.date_cut_day, nyt_headline_date))
nyt_headline_date_token = [[nyt_headline_date[i][1]] + nyt_headline_token[i] for i in range(len(nyt_headline_date))]
nyt_headline_date_token = sorted(nyt_headline_date_token, key=lambda date_plus_tit_txt: date_plus_tit_txt[0])
nyt_headline_token = list(map(nyt_func.from_combination_to_token, nyt_headline_date_token))

nyt_headline_sentence = list(map(nyt_func.word_to_sentence, nyt_headline_token))

nyt_date = [[nyt_headline_date[i][1]] for i in range(len(nyt_headline_date))]
nyt_headline_sentence = [nyt_date[i] + [nyt_headline_sentence[i]] for i in range(len(nyt_headline_sentence))]

먼저, date를 YYYY-MM-DD 형식으로 만든다.

그리고 date와 token을 결합하고 정렬해 기사가 작성된 순서대로 정리한다.

그리고 from_combination_to_token 함수를 이용하여 date를 제외한 나머지 값만 뽑아낸다.

lambda를 쓸 수도 있지만 문장이 너무 길어지고 보기 어려우므로 함수를 생성해서 처리하였다.

그리고 word_to_sentence 함수를 이용해서 한 리스트의 token을 한 문장으로 결합한다.

그리고 다시 date를 결합한다.

%%time
analyzer = SentimentIntensityAnalyzer()
nyt_headline_sentence = [nyt_headline_sentence[i] + [analyzer.polarity_scores(nyt_headline_sentence[i][1])["compound"]] for i in range(len(nyt_headline_sentence))]

nyt_score = []
for i in range(len(nyt_headline_sentence)):
    if nyt_headline_sentence[i][1] != "":
        score = nyt_headline_sentence[i] + [TextBlob(nyt_headline_sentence[i][1]).sentences[0].sentiment.polarity]
    else : score = nyt_headline_sentence[i] + [0.0]
    nyt_score.append(score)
    
    
nyt_score = pd.DataFrame(nyt_score)
nyt_score.columns = ["Date","Headline", "N_Score", "B_Score"]
nyt_score = nyt_score[["Date", "N_Score", "B_Score"]]
nyt_score.to_csv("nyt_score.csv", index = False)

약 15분의 시간을 소요해 NLTK의 감성분석을 실행하였고, compound 값만 추출하였다. -1~1 사이의 값으로 긍정, 중립, 부정을 평가한다.

그리고 TextBlob을 이용해 score을 계산한 값도 추가를 한다.

원래는 headline은 NLTK, 본문은 TextBlob으로 처리해야하지만 NYT의 뉴스 본문까지 가져오려면 일정한 금액을 내야하므로 headline을 NLTK와 TextBlob으로 각각 점수를 내었다.

DataFrame으로 형식을 변경한 후 열이름을 지정 후 필요한 열을 뽑아 csv 파일로 저장하였다.

세번째로 분석하는 파일을 생성한다.

앞에서 보여주었던 train_test_result와 같은 구조의 함수를 사용하지만 사용하는 데이터가 다르므로 함수를 새로 만들어준다. 이에 대한 설명은 생략하겠다(추후 train_test_result 함수에 대해 설명하는 글을 올릴 것이다.)

import json
import numpy as np
import pandas as pd

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import regex as re
from nltk.corpus import stopwords

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import datetime

import yfinance as yf
from sklearn.preprocessing import MinMaxScaler

import random
import os

from sklearn.metrics import mean_squared_error 
from sklearn.metrics import mean_absolute_error
import tensorflow as tf
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.models import Sequential
from keras import optimizers
from keras import backend as K

import sys
sys.path.append(os.path.dirname(os.path.abspath(os.path.dirname("__file__"))))
import train_test as tt

필요한 패키지를 생성해준다.

주가 예측 (1) 글에서 보여준대로 초기값을 세팅하고 평가방법에 대한 함수를 생성한다.

그리고 앞에서 말한 바와 같이 새로 train_test_result 함수를 생성하여준다.

dir = "nyt_score.csv"
stock = "AMZN"
variable = ["Open","High","Low","Volume","N_Score", "B_Score"]
window_size = 1
start_date = "2015-01-01"
end_date = "2020-08-13"

scaler_x, scaler_y, x, y, x_t, y_t = train_test_result(dir, stock, variable, window_size, start_date, end_date)

먼저, 논문과 동일한 기간, 동일한 종목, 동일한 변수로 구성한다.

lstm_1_model = Sequential()
lstm_1_model.add(LSTM(50, input_shape = (x.shape[1],x.shape[2]),return_sequences=True))
lstm_1_model.add((LSTM(50)))
lstm_1_model.add(Dense(1,activation="tanh"))
lstm_1_model.compile(optimizer="adam", loss = 'mse' , metrics=["mae"])
history = lstm_1_model.fit(x, y, epochs=40, batch_size=128, validation_split=0.2)
    
lstm_1_model.evaluate(x_t, y_t)

LSTM을 사용하였고, window size = 1, LSTM 50 units, LSTM 50 units, batch size 128. epoch 40으로 모델을 생성하였다고 논문에 적혀있었다.

그 외의 부분은 임의로 설정하였다.

결과는 논문에서는 MSE 기준 0.001이 나왔고, MAE 기준 0.026이 나왔다.

본문을 사용하지 않은 것만 차이점이고 나머지 정보는 동일하게 실험해본 결과, 정규화를 돌려주지 않았을 땐, MSE 기준 0.002, MAE 기준 0.023이다. 결과차이가 크게 나진 않는다.

하지만 정규화를 돌려준 결과 MSE 기준 39.772, MAE 기준 3.414이다. 논문 결과가 정규화를 다시 복구시켜준 결과라면 결과차이가 상당한 것이다.

그리고 동일 모델가지고 다우존스의 종가를 논문 재현을 위해 사용했던 변수와 close와 NBC 뉴스의 감성분석 score을 변수로 각각 실험해본 결과, 후자의 결과가 미세하게 좋게 나왔다. 즉, 종가를 제외한 주가 데이터와 NYT 뉴스를 사용한 모델보다 NBC 뉴스와 종가를 사용한 모델이 더 결과가 좋은 것으로 나왔다.