基于CNN-LSTM的序列预测方法

AI Studio

12716人浏览 · 2022-06-03 00:02:51

AI Studio · 2022-06-03 00:02:51 发布

基于CNN-LSTM的序列数据预测方法
一、股票价格预测
1、Tushare简介
Tushare是一个免费、开源的python财经数据接口包。主要实现对股票等金融数据从数据采集、清洗加工到数据存储的过程，能够为金融分析人员提供快速、整洁、和多样的便于分析的数据，为他们在数据获取方面极大地减轻工作量，使他们更加专注于策略和模型的研究与实现上。

Tushare基于python库Pandas存储并处理数据，这样可以提高数据的处理效果，并且格式美观，全面，包括使用方法，输入参数，输出参数等，都是清晰透明，同时对于所有函数，有相关示例加以展示。对新手小白很友好。数据全面，包括股票、基金、债券、期货、境外股票信息，最令人震惊的是还有关于中国内部宏观经济，这对判断股市的大方向有很好的作用。在这里插入图片描述
In [ ]

安装tushare

!pip install tushare
In [ ]
import tushare as ts
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import paddle
import paddle.nn as nn
from sklearn.preprocessing import MinMaxScaler
2、获取数据
我们首先通过tushare获取股票数据，这里以中国平安的日线数据为例。用到的数据包括开高低收以及交易量，最后将数据保存至本地。数据获取代码如下

In [ ]

获取中国平安三年内K线数据

data = ts.get_hist_data(‘000001’)
data.index = pd.to_datetime(data.index)
data = data[[‘open’, ‘high’, ‘close’, ‘low’, ‘volume’]]

获取到的数据是时间逆序的，这里将其翻转并重置索引

data = data[::-1].reindex()
data.to_csv(‘./data.csv’)

print(data)

得到的数据如下图所示：
data open high close low volume

2019-11-27 15.64 15.64 15.47 15.39 476170.00
2019-11-28 15.47 15.54 15.49 15.44 378016.53
2019-11-29 15.54 15.55 15.29 15.18 653866.94
2019-12-02 15.35 15.43 15.36 15.23 553877.38
2019-12-03 15.30 15.46 15.45 15.21 455418.03
… … … … … …
2022-05-23 15.07 15.07 14.83 14.76 647725.94
2022-05-24 14.87 14.87 14.40 14.40 933557.06
2022-05-25 14.43 14.49 14.39 14.30 603228.25
2022-05-26 14.41 14.43 14.19 14.07 1011317.62
2022-05-27 14.29 14.35 14.18 14.11 723066.75

[606 rows x 5 columns]
3、数据预处理
接下来需要对数据进行预处理，包括窗口化以及训练集和测试集的划分等。首先定义一个窗口划分的函数

In [ ]

窗口划分

def split_windows(data, size):
X = []
Y = []
# X作为数据，Y作为标签
# 滑动窗口，步长为1，构造窗口化数据，每一个窗口的数据标签是窗口末端的close值（收盘价格）
for i in range(len(data) - size):
X.append(data[i:i+size, :])
Y.append(data[i+size, 2])
return np.array(X), np.array(Y)
4、划分训练集和测试集
其中前500条数据用于模型训练，后面的数据用于模型测试。具体地，我们用到了开高低收以及交易量数据来预测下一时刻的收盘价数据集。
In [ ]
df = pd.read_csv(‘./data.csv’, usecols=[‘open’, ‘high’, ‘close’, ‘low’, ‘volume’])
all_data = df.values
train_len = 500
train_data = all_data[:train_len, :]
test_data = all_data[train_len:, :]

plt.figure(figsize=(12, 8))

数据可视化

plt.plot(np.arange(train_data.shape[0]), train_data[:, 2], label=‘train data’)
plt.plot(np.arange(train_data.shape[0], train_data.shape[0] + test_data.shape[0]), test_data[:, 2], label=‘test data’)
plt.legend()
<matplotlib.legend.Legend at 0x7feb1dcb2b90> 在这里插入图片描述
5、数据归一化处理：
MinMaxScaler():自动最大最小归一化方法
In [ ]

normalizatioin processing

scaler = MinMaxScaler()
scaled_train_data = scaler.fit_transform(train_data)

使用训练集的最值对测试集归一化，保证训练集和测试集的分布一致性

scaled_test_data = scaler.transform(test_data)

训练集测试集划分

window_size = 7
train_X, train_Y = split_windows(scaled_train_data, size=window_size)
test_X, test_Y = split_windows(scaled_test_data, size=window_size)
print(‘train shape’, train_X.shape, train_Y.shape)
print(‘test shape’, test_X.shape, test_Y.shape)
train shape (493, 7, 5) (493,)
test shape (99, 7, 5) (99,)
训练集和测试集的shape如下，其中493表示训练的窗口数据个数，7表示窗口大小，5表示数据特征个数，这样每个样本数据就是一个7*5的2D图像的形式。
train shape (493, 7, 5) (493,)
test shape (99, 7, 5) (99,)
6、模型搭建
首先是输入层和Reshape层，用于将输入数据转换为指定的输入形式，其中每个输入数据是一个751的2D图像的形式，7表示宽，5表示高，1表示通道个数。

然后通过一个2D Conv层进行卷积操作，滤波器个数为64，padding设置为same用于获取相同大小的feature map，激活函数为relu。

接着通过一个Maxpooling进行下采样，然后接一个Dropout用于防止过拟合。

之后连接两层LSTM层，从时间步的维度进行时序建模。

最后通过全连接层进行输出下一时刻的预测值。

模型的loss function选择为均方误差，优化方法采用adam优化器。
In [ ]
window_size = 7
fea_num = 5
batch_size = 32
class CNN_LSTM(nn.Layer):
def init(self, window_size, fea_num):
super().init()
self.window_size = window_size
self.fea_num = fea_num
self.conv1 = nn.Conv2D(in_channels=1, out_channels=64, stride=1, kernel_size=3, padding=‘same’)
self.relu1 = nn.ReLU()
self.pool = nn.MaxPool2D(kernel_size=2, stride=1, padding=‘same’)
self.dropout = nn.Dropout2D(0.3)

    self.lstm1 = nn.LSTM(input_size=64*fea_num, hidden_size=128, num_layers=1, time_major=False)
    self.lstm2 = nn.LSTM(input_size=128, hidden_size=64, num_layers=1, time_major=False)
    self.fc = nn.Linear(in_features=64, out_features=32)
    self.relu2 = nn.ReLU()
    self.head = nn.Linear(in_features=32, out_features=1)

def forward(self, x):
    x = x.reshape([x.shape[0], 1, self.window_size, self.fea_num])
    x = self.conv1(x)
    x = self.relu1(x)
    x = self.pool(x)
    x = self.dropout(x)

    x = x.reshape([x.shape[0], self.window_size, -1])
    x, (h, c) = self.lstm1(x)
    x, (h,c) = self.lstm2(x)
    x = x[:,-1,:] # 最后一个LSTM只要窗口中最后一个特征的输出
    x = self.fc(x)
    x = self.relu2(x)
    x = self.head(x)

    return x

In [ ]
model = CNN_LSTM(window_size, fea_num)
paddle.summary(model, (99,7, 5))

Layer (type) Input Shape Output Shape Param #

Conv2D-8 [[99, 1, 7, 5]] [99, 64, 7, 5] 640
ReLU-15 [[99, 64, 7, 5]] [99, 64, 7, 5] 0
MaxPool2D-8 [[99, 64, 7, 5]] [99, 64, 7, 5] 0
Dropout2D-8 [[99, 64, 7, 5]] [99, 64, 7, 5] 0
LSTM-15 [[99, 7, 320]] [[99, 7, 128], [[1, 99, 128], [1, 99, 128]]] 230,400
LSTM-16 [[99, 7, 128]] [[99, 7, 64], [[1, 99, 64], [1, 99, 64]]] 49,664
Linear-15 [[99, 64]] [99, 32] 2,080
ReLU-16 [[99, 32]] [99, 32] 0
Linear-16 [[99, 32]] [99, 1] 33

Total params: 282,817
Trainable params: 282,817
Non-trainable params: 0

Input size (MB): 0.01
Forward/backward pass size (MB): 8.12
Params size (MB): 1.08
Estimated Total Size (MB): 9.21

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/numpy/core/fromnumeric.py:87: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify ‘dtype=object’ when creating the ndarray
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
{‘total_params’: 282817, ‘trainable_params’: 282817}
In [ ]

定义超参数

base_lr = 0.005
BATCH_SIZE = 32
EPOCH = 200
lr_schedual = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=base_lr, T_max=EPOCH, verbose=True)
loss_fn = nn.MSELoss()
metric = paddle.metric.Accuracy()
opt = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=lr_schedual, beta1=0.9, beta2=0.999)
Epoch 0: CosineAnnealingDecay set learning rate to 0.005.
In [ ]
def process(data, bs):
l = len(data)
tmp = []
for i in range(0, l, bs):
if i + bs > l:
tmp.append(data[i:].tolist())
else:
tmp.append(data[i:i+bs].tolist())
tmp = np.array(tmp)
return tmp
In [ ]

处理数据集

train_X = process(train_X, 32)
train_Y = process(train_Y, 32)
print(train_X.shape, train_Y.shape)
(16,) (16,)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:10: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify ‘dtype=object’ when creating the ndarray

Remove the CWD from sys.path while we load stuff.

7、模型训练
In [ ]

模型训练

for epoch in range(EPOCH):
model.train()
loss_train = 0
for batch_id, data in enumerate(train_X):
label = train_Y[batch_id]
data = paddle.to_tensor(data, dtype=‘float32’)
label = paddle.to_tensor(label, dtype=‘float32’)
label = label.reshape([label.shape[0],1])
y = model(data)

    loss = loss_fn(y, label)
    opt.clear_grad()
    loss.backward()
    opt.step()
    loss_train += loss.item()
print("[TRAIN] ========epoch : {},  loss: {:.4f}==========".format(epoch+1, loss_train))
lr_schedual.step()

# loss_eval = 0
# model.eval()
# for batch_id, data in enumerate(test_X):
#     label = test_Y[batch_id]
#     data = paddle.to_tensor(data, dtype='float32')
#     label = paddle.to_tensor(label, dtype='float32')
#     label = label.reshape([label.shape[0],1])
#     y = model(data)

#     loss = loss_fn(y, label)
#     loss_eval += loss.item()
# print("[EVAL] ========epoch : {},  loss: {:.4f}==========\n".format(epoch+1, loss_eval))

保存模型参数

paddle.save(model.state_dict(), ‘work/cnn_lstm_ep200_lr0.005.params’)
paddle.save(lr_schedual.state_dict(), ‘work/cnn_lstm_ep200_lr0.005.pdopts’)
训练过程如下：

[TRAIN] epoch: 1, loss: 5.8027==
[TRAIN] epoch: 2, loss: 1.6156==
[TRAIN] epoch: 3, loss: 2.6685==
[TRAIN] epoch: 4, loss: 0.4328==
… … …
[TRAIN] epoch: 198, loss: 0.0188==
[TRAIN] epoch: 199, loss: 0.0191==
[TRAIN] epoch: 200, loss: 0.0183==

8、模型评估
In [ ]

加载模型

model = CNN_LSTM(window_size, fea_num)
model_dict = paddle.load(‘work/cnn_lstm_ep100_lr0.005.params’)
model.load_dict(model_dict)

test_X = paddle.to_tensor(test_X, dtype=‘float32’)
prediction = model(test_X)
prediction = prediction.cpu().numpy()
prediction = prediction.reshape(prediction.shape[0], )

反归一化

scaled_prediction = prediction * (scaler.data_max_[2] - scaler.data_min_[2]) + scaler.data_min_[2]
scaled_true = test_Y * (scaler.data_max_[2] - scaler.data_min_[2]) + scaler.data_min_[2]

画图

plt.plot(range(len(scaled_true)), scaled_true, label=‘true’)
plt.plot(range(len(scaled_prediction)), scaled_prediction, label=‘prediction’, marker=‘*’)
plt.legend()

from sklearn.metrics import mean_squared_error
print(‘RMSE’, np.sqrt(mean_squared_error(scaled_prediction, scaled_true)))
RMSE 0.4623269038898004 在这里插入图片描述
9、总结
使用RMSE作为预测精度的评估标准，并进行可视化，如上图所示。蓝色折线为股票的真实收盘价格，黄色的折线为模型的预测值，最终评估的RMSE值为0.46，拟合精度比较高。
二、风电场风况预测
1、数据准备
接下来对风电场风况数据进行预测，数据集来自2017-01-18当天的全部数据，数据记录每隔30s记录一次，包括当前时刻的变频器电网侧有功功率、外界温度、风向、风速四个特征
In [ ]
df = pd.read_csv(‘./2017-01-18.csv’, usecols=[‘watt’, ‘temperature’, ‘windsock’, ‘windspeed’])
all_data = df.values
train_len = 2000
train_data = all_data[:train_len, :]
test_data = all_data[train_len:, :]

plt.figure(figsize=(15, 8))

数据可视化

plt.plot(np.arange(train_data.shape[0]), train_data[:, 2], label=‘train data’)
plt.plot(np.arange(train_data.shape[0], train_data.shape[0] + test_data.shape[0]), test_data[:, 2], label=‘test data’)
plt.legend()
<matplotlib.legend.Legend at 0x7fd266ec43d0> 在这里插入图片描述
2、数据归一化
采用相同的处理步骤
In [ ]

normalizatioin processing

scaler = MinMaxScaler()
scaled_train_data = scaler.fit_transform(train_data)

使用训练集的最值对测试集归一化，保证训练集和测试集的分布一致性

scaled_test_data = scaler.transform(test_data)

训练集测试集划分

window_size = 30

split_windows函数定义见上方

train_X, train_Y = split_windows(scaled_train_data, size=window_size)
test_X, test_Y = split_windows(scaled_test_data, size=window_size)
print(‘train shape’, train_X.shape, train_Y.shape)
print(‘test shape’, test_X.shape, test_Y.shape)
train shape (1970, 30, 4) (1970,)
test shape (850, 30, 4) (850,)
In [ ]

process函数定义见上方

train_X = process(train_X, 32)
train_Y = process(train_Y, 32)
print(train_X.shape, train_Y.shape)
(62,) (62,)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:9: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify ‘dtype=object’ when creating the ndarray
if name == ‘main’:
3、模型搭建
首先利用编码器对特征进行扩充，方便进入卷积层。

然后通过一个2D Conv层进行卷积操作，滤波器个数为64，padding设置为same用于获取相同大小的feature map，激活函数为relu。

接着通过一个Maxpooling进行下采样，然后接一个Dropout用于防止过拟合。

之后连接三层LSTM层，从时间步的维度进行时序建模。

最后通过全连接层进行输出下一时刻的预测值。

模型的loss function选择为均方误差，优化方法采用adam优化器。
In [ ]
window_size = 30
fea_num = 4
out_fea = 30
batch_size = 32
class CNN_LSTM(nn.Layer):
def init(self, window_size, fea_num):
super().init()
self.window_size = window_size
self.fea_num = fea_num
self.proj = nn.Linear(in_features=fea_num, out_features=out_fea)
self.conv1 = nn.Conv2D(in_channels=1, out_channels=64, stride=1, kernel_size=3, padding=‘same’)
self.relu1 = nn.ReLU()
self.pool = nn.MaxPool2D(kernel_size=2, stride=1, padding=‘same’)
self.dropout = nn.Dropout2D(0.3)

    self.lstm1 = nn.LSTM(input_size=64*out_fea, hidden_size=256, num_layers=1, time_major=False)
    self.lstm2 = nn.LSTM(input_size=256, hidden_size=128, num_layers=1, time_major=False)
    self.lstm3 = nn.LSTM(input_size=128, hidden_size=64, num_layers=1, time_major=False)
    self.fc = nn.Linear(in_features=64, out_features=32)
    self.relu2 = nn.ReLU()
    self.head = nn.Linear(in_features=32, out_features=1)

def forward(self, x):
    x = x.reshape([x.shape[0], 1, self.window_size, self.fea_num])
    x = self.proj(x)
    x = self.conv1(x)
    x = self.relu1(x)
    x = self.pool(x)
    x = self.dropout(x)

    x = x.reshape([x.shape[0], self.window_size, -1])
    x, (h, c) = self.lstm1(x)
    x, (h,c) = self.lstm2(x)
    x, (h,c) = self.lstm3(x)
    x = x[:,-1,:] # 最后一个LSTM只要窗口中最后一个特征的输出
    x = self.fc(x)
    x = self.relu2(x)
    x = self.head(x)

    return x

In [15]
model = CNN_LSTM(window_size, fea_num)
paddle.summary(model, (99, 30 , 4))

Layer (type) Input Shape Output Shape Param #

Linear-7 [[99, 1, 30, 4]] [99, 1, 30, 30] 150
Conv2D-3 [[99, 1, 30, 30]] [99, 64, 30, 30] 640
ReLU-5 [[99, 64, 30, 30]] [99, 64, 30, 30] 0
MaxPool2D-3 [[99, 64, 30, 30]] [99, 64, 30, 30] 0
Dropout2D-3 [[99, 64, 30, 30]] [99, 64, 30, 30] 0
LSTM-7 [[99, 30, 1920]] [[99, 30, 256], [[1, 99, 256], [1, 99, 256]]] 2,230,272
LSTM-8 [[99, 30, 256]] [[99, 30, 128], [[1, 99, 128], [1, 99, 128]]] 197,632
LSTM-9 [[99, 30, 128]] [[99, 30, 64], [[1, 99, 64], [1, 99, 64]]] 49,664
Linear-8 [[99, 64]] [99, 32] 2,080
ReLU-6 [[99, 32]] [99, 32] 0
Linear-9 [[99, 32]] [99, 1] 33

Total params: 2,480,471
Trainable params: 2,480,471
Non-trainable params: 0

Input size (MB): 0.05
Forward/backward pass size (MB): 185.58
Params size (MB): 9.46
Estimated Total Size (MB): 195.09

定义超参数

base_lr = 0.005
BATCH_SIZE = 32
EPOCH = 120
lr_schedual = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=base_lr, T_max=EPOCH, verbose=True)
loss_fn = nn.MSELoss()
metric = paddle.metric.Accuracy()
opt = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=lr_schedual, beta1=0.9, beta2=0.999)
Epoch 0: CosineAnnealingDecay set learning rate to 0.005.
4、模型训练

采取历史梯度驱动训练策略，使用当前梯度与历史梯度加权和作为本轮优化梯度

In [17]

模型训练

    loss = loss_fn(y, label)
    opt.clear_grad()
    loss.backward()
    opt.step()
    loss_train += loss.item()
print("[TRAIN] ========epoch : {},  loss: {:.4f}==========".format(epoch+1, loss_train))
lr_schedual.step()

# loss_eval = 0
# model.eval()
# for batch_id, data in enumerate(test_X):
#     label = test_Y[batch_id]
#     data = paddle.to_tensor(data, dtype='float32')
#     label = paddle.to_tensor(label, dtype='float32')
#     label = label.reshape([label.shape[0],1])
#     y = model(data)

#     loss = loss_fn(y, label)
#     loss_eval += loss.item()
# print("[EVAL] ========epoch : {},  loss: {:.4f}==========\n".format(epoch+1, loss_eval))

保存模型参数

paddle.save(model.state_dict(), ‘wind/cnn_lstm_ep100_lr0.005.params’)
paddle.save(lr_schedual.state_dict(), ‘wind/cnn_lstm_ep100_lr0.005.pdopts’)
训练过程如下：
[TRAIN] epoch : 1, loss: 2.6941==
[TRAIN] epoch : 2, loss: 1.7112==
[TRAIN] epoch : 3, loss: 1.8545==
… … …
[TRAIN] epoch : 117, loss: 0.3156==
[TRAIN] epoch : 118, loss: 0.3080==
[TRAIN] epoch : 119, loss: 0.3078==
[TRAIN] epoch : 120, loss: 0.3038==
In [19]

加载模型

model = CNN_LSTM(window_size, fea_num)
model_dict = paddle.load(‘wind/cnn_lstm_ep100_lr0.005.params’)
model.load_dict(model_dict)

test_X = paddle.to_tensor(test_X, dtype=‘float32’)
prediction = model(test_X)
prediction = prediction.cpu().numpy()
prediction = prediction.reshape(prediction.shape[0], )

反归一化

scaled_prediction = prediction * (scaler.data_max_[2] - scaler.data_min_[2]) + scaler.data_min_[2]
scaled_true = test_Y * (scaler.data_max_[2] - scaler.data_min_[2]) + scaler.data_min_[2]

画图

plt.figure(figsize=(15, 8))
plt.plot(range(len(scaled_true)), scaled_true, label=‘true’)
plt.plot(range(len(scaled_prediction)), scaled_prediction, label=‘prediction’, marker=‘*’)
plt.legend()

from sklearn.metrics import mean_squared_error
print(‘RMSE’, np.sqrt(mean_squared_error(scaled_prediction, scaled_true)))
RMSE 0.01224376109865677 在这里插入图片描述
5、总结
第一次尝试使用CNN-LSTM结合的方法实现时间序列预测问题，可以得到不错的精度