机器学习项目三：XGBoost人体卡路里消耗预测

AI Studio

703人浏览 · 2022-11-14 22:58:40

AI Studio · 2022-11-14 22:58:40 发布

一、项目背景

1.1 什么是卡里路

相信健身热爱运动，减肥的的朋友对这个名词一定不陌生！
卡路里（Calorie），简称卡，缩写为cal，其定义为在1个大气压下，将1克水提升1摄氏度所需要的热量；卡路里（calorie）是一种热量单位，被广泛使用在营养计量和健身手册上，国际标准的能量单位是焦耳（joule）

在这里插入图片描述

2.2 项目介绍

如今，随着健康生活的理念越来越深入人心，越来越多的健身爱好者希望能够测量出参加力量训练时候的卡路里的消耗及之后的饮食调节，然而，传统的获取运动过程中人体的能量消耗不仅程序繁琐，且还需要额外的设备，因此我们急切需要找到一种能够方便并有效的检测力量训练时的卡路里的消耗和动作识别的方法,我们在获取用户的一系列身体数据后，就能实时在线的预测出人体消耗的能量
日常健身过程中，尤其是力量训练时，人体将消耗大量的卡路里。
为了有助于训练后的营养补充和膳食搭配，为人体能力代谢，特别是喜欢运动的人群做出实时的能量消耗预测，提供一个快速，准确的人体卡路里消耗预测！
为此我们搭建了一个预测人体消耗卡路里的系统！该系统使用机器学习XGBoost回归算法，可以根据用户的性别、年龄、身高、体重、锻炼持续时间、心率，身体温度这几项数据就可以实时在线的评估一个人的能量消耗。

二、导入依赖库

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor

#导入数据集
calories = pd.read_csv(r"work/calories.csv")
calories.head()

	User_ID	Calories
0	14733363	231.0
1	14861698	66.0
2	11179863	26.0
3	16180408	71.0
4	17771927	35.0

exercise = pd.read_csv("work/exercise.csv")
exercise.head()

	User_ID	Gender	Age	Height	Weight	Duration	Heart_Rate	Body_Temp
0	14733363	male	68	190.0	94.0	29.0	105.0	40.8
1	14861698	female	20	166.0	60.0	14.0	94.0	40.3
2	11179863	male	69	179.0	79.0	5.0	88.0	38.7
3	16180408	female	34	179.0	71.0	13.0	100.0	40.5
4	17771927	female	27	154.0	58.0	10.0	81.0	39.8

# 合并数据集
df = pd.concat([exercise,calories.Calories],axis=1)
df.head()

	User_ID	Gender	Age	Height	Weight	Duration	Heart_Rate	Body_Temp	Calories
0	14733363	male	68	190.0	94.0	29.0	105.0	40.8	231.0
1	14861698	female	20	166.0	60.0	14.0	94.0	40.3	66.0
2	11179863	male	69	179.0	79.0	5.0	88.0	38.7	26.0
3	16180408	female	34	179.0	71.0	13.0	100.0	40.5	71.0
4	17771927	female	27	154.0	58.0	10.0	81.0	39.8	35.0

df.shape

(15000, 9)

三、数据探索分析—EDA

3.1 数据描述

df.describe()#查看数据的信息

	User_ID	Age	Height	Weight	Duration	Heart_Rate	Body_Temp	Calories
count	1.500000e+04	15000.000000	15000.000000	15000.000000	15000.000000	15000.000000	15000.000000	15000.000000
mean	1.497736e+07	42.789800	174.465133	74.966867	15.530600	95.518533	40.025453	89.539533
std	2.872851e+06	16.980264	14.258114	15.035657	8.319203	9.583328	0.779230	62.456978
min	1.000116e+07	20.000000	123.000000	36.000000	1.000000	67.000000	37.100000	1.000000
25%	1.247419e+07	28.000000	164.000000	63.000000	8.000000	88.000000	39.600000	35.000000
50%	1.499728e+07	39.000000	175.000000	74.000000	16.000000	96.000000	40.200000	79.000000
75%	1.744928e+07	56.000000	185.000000	87.000000	23.000000	103.000000	40.600000	138.000000
max	1.999965e+07	79.000000	222.000000	132.000000	30.000000	128.000000	41.500000	314.000000

3.2 判断是否有缺失值

df.isnull().sum()

User_ID       0
Gender        0
Age           0
Height        0
Weight        0
Duration      0
Heart_Rate    0
Body_Temp     0
Calories      0
dtype: int64

df.columns

Index(['User_ID', 'Gender', 'Age', 'Height', 'Weight', 'Duration',
       'Heart_Rate', 'Body_Temp', 'Calories'],
      dtype='object')

# 连续变量
constant_features = [ 'Age', 'Height', 'Weight', 'Duration',
       'Heart_Rate', 'Body_Temp']

df.head()

	User_ID	Gender	Age	Height	Weight	Duration	Heart_Rate	Body_Temp	Calories
0	14733363	male	68	190.0	94.0	29.0	105.0	40.8	231.0
1	14861698	female	20	166.0	60.0	14.0	94.0	40.3	66.0
2	11179863	male	69	179.0	79.0	5.0	88.0	38.7	26.0
3	16180408	female	34	179.0	71.0	13.0	100.0	40.5	71.0
4	17771927	female	27	154.0	58.0	10.0	81.0	39.8	35.0

3.3 画出概率密度图

此处采用了两种画法，一种是matplotlib里面的画法，一种是，seaborn里面的画法


def kde_plot_array(df):
    """
    绘制概率密度图矩阵函数
    df:要绘制图像的dataframe
    绘制各个字段的概率密度分布，最终返回图像的show()
    """
    plt.figure(figsize = (24,20))
    # subplots_adjust(left = 0,bottom = 0,top = 1.4,right = 1)
    for num,col in zip(range(len(df.columns)),df.columns):
        plt.subplot(round(len(df.columns)/2,0),2,num+1)
        # sns.set(font = 'FangSong',font_scale = 1.6)
        # index = columns
        sns.kdeplot(df[col],shade = True,label = col,alpha = 0.7)
        plt.legend()
        plt.title('{}'.format(col))
    return plt.show()

kde_plot_array(df[constant_features])

在这里插入图片描述

3.4 查看特征分布

sns.countplot(df['Gender']) #此处可以看出男女性别分布，基本一样

<matplotlib.axes._subplots.AxesSubplot at 0x7fa7213ee5d0>

在这里插入图片描述

def display(df):
    '''用seaborn的displot函数查看变量分布'''
    plt.figure(figsize = (24,20))
    # subplots_adjust(left = 0,bottom = 0,top = 1.4,right = 1)
    for num,col in zip(range(len(df.columns)),df.columns):
      
        plt.subplot(round(len(df.columns)/2,0),2,num+1)
        # plt.figure(figsize=(20,12))
        # sns.set(font = 'FangSong',font_scale = 1.6)
        # index = columns
        # sns.kdeplot(df[col],shade = True,label = col,alpha = 0.7)
        sns.distplot(df[col])
        # plt.legend()
        plt.title('{}'.format(col))
    return plt.show()

display(df[constant_features])

在这里插入图片描述

#离散变量编码，此处用labelencoder也可，本文直接用的df的replace函数，更方便
df.replace({'Gender':{'male':0,"female":1}}, inplace = True)

3.5 创建特征和标签

X = df.drop(['User_ID','Calories'],axis=1).values
y = df.Calories

print(X)

[[  0.   68.  190.  ...  29.  105.   40.8]
 [  1.   20.  166.  ...  14.   94.   40.3]
 [  0.   69.  179.  ...   5.   88.   38.7]
 ...
 [  1.   43.  159.  ...  16.   90.   40.1]
 [  0.   78.  193.  ...   2.   84.   38.3]
 [  0.   63.  173.  ...  18.   92.   40.5]]

3.6划分数据集

X_train ,X_test ,y_train ,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

print(X_train.shape,X_test.shape)
print(y_train.shape,y_test.shape)

(12000, 7) (3000, 7)
(12000,) (3000,)

三、模型训练

model = XGBRegressor(random_state=42) #本次项目选用XGBoost算法
model.fit(X_train,y_train)
X_preds = model.predict(X_train)

四、模型预测

预测部分直接调佣XGBOOST的预测函数，即可得出预测值,我们可以选用其他

preds = model.predict(X_test)

#查看预测值
preds

array([127.823784, 226.00154 ,  38.66253 , ..., 144.3636  ,  22.767195,
        89.87375 ], dtype=float32)

4.1 可视化预测与真实值

可以看出预测值和真实值十分接近，证明了我们模型的有效性

plt.scatter(y_test,preds)
plt.xlabel('y_test')
plt.ylabel('preds')
plt.title('y_test VS preds')
plt.show()

在这里插入图片描述

4.1 打印绝对误差

mae = metrics.mean_absolute_error(y_test,preds)
mae

1.4807048829992613

4.2 打印均方根误差

Rmse = np.sqrt(metrics.mean_squared_error(y_test,preds))
Rmse

2.12938076108955

4.3 打印均方根误差

可以看出r2——score十分接近1，可见模型预测的效果很好

preds_R2_score = metrics.r2_score(y_test,preds)
preds_R2_score

0.9988455491362879

五、构建预测系统

即用户输入对应的数据，即可根据输入预测出人体消耗的卡路里值，还可以部署到设备中，开发一套能量消耗预测系统

input_data = (1 , 20 , 166.0 ,  60.0 , 14.0 , 94.0 ,40.3)

# 转化为numpy数组
input_data_as_numpy_array = np.asarray(input_data)

# reshape 成array二维
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = model.predict(input_data_reshaped)
print(prediction)

diction)

print('此人卡路里消耗值为{} '.format(prediction[0]))

[64.68266]
此人卡路里消耗值为64.68266296386719

项目总结

本项目只采用了XGBoost回归算法，后续还可尝试更多的回归算法，或者是深度学习神经网络算法，不断对模型调优，提高预测精度

作者简介

武汉科技大研二学生，研究方向，计算机视觉，机器视觉，生物医学大数据

此文章为搬运
原项目链接

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

RAPTOR：索引树状 RAG，使用树结构来捕捉文本的高级和低级细节

百度飞桨星河社区

MultiHop-RAG：多跳查询的基准检索增强生成

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1436条内容