基于机器学习的【肺癌风险分析预测】

通过imblearn进行数据均衡，通过StandardScaler进行数据标准化，并选择随机森林分类模型进行训练、预测

AI Studio

2986人浏览 · 2023-04-29 23:07:01

AI Studio · 2023-04-29 23:07:01 发布

★★★ 本文源自AlStudio社区精品项目，【点击此处】查看更多精品内容 >>>

一、肺癌风险预测

1.背景描述

癌症预测系统的有效性帮助人们以较低的成本了解自己的癌症风险，也帮助人们根据自己的癌症风险状况做出适当的决定。数据收集自在线肺癌预测网站。

2.数据说明

字段总数：16
实例数：284
字段信息：
1.性别：M（男性），F（女性）
2.年龄：病人的年龄
3.吸烟：YES=2 , NO=1
4.黄色的手指：YES=2 , NO=1
5.焦虑：YES=2 , NO=1
6.同伴压力: YES=2 , NO=1
7.慢性疾病：YES=2 , NO=1
8.疲劳：YES=2 , NO=1
9.过敏症：YES=2 , NO=1
10.喘息：YES=2 , NO=1
11.酒精：YES=2 , NO=1
12.咳嗽： YES=2 , NO=1
13.呼吸急促：YES=2 , NO=1
14.吞咽困难：YES=2 , NO=1
15.胸部疼痛：YES=2 , NO=1
16.肺癌：YES , NO

3.数据来源

https://www.kaggle.com/datasets/nancyalaswad90/lung-cancer

二、数据处理

1.读取数据

import pandas as pd
df=pd.read_csv("data/data209803/survey_lung_cancer.csv", index_col=None)
df.head()

	GENDER	AGE	SMOKING	YELLOW_FINGERS	ANXIETY	PEER_PRESSURE	CHRONIC DISEASE	FATIGUE	ALLERGY	WHEEZING	ALCOHOL CONSUMING	COUGHING	SHORTNESS OF BREATH	SWALLOWING DIFFICULTY	CHEST PAIN	LUNG_CANCER
0	M	69	1	2	2	1	1	2	1	2	2	2	2	2	2	YES
1	M	74	2	1	1	1	2	2	2	1	1	1	2	2	2	YES
2	F	59	1	1	1	2	1	2	1	2	1	2	2	1	2	NO
3	M	63	2	2	2	1	1	1	1	1	2	1	1	2	2	NO
4	F	63	1	2	1	1	1	1	1	2	1	2	2	1	1	NO

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   GENDER                 309 non-null    object
 1   AGE                    309 non-null    int64 
 2   SMOKING                309 non-null    int64 
 3   YELLOW_FINGERS         309 non-null    int64 
 4   ANXIETY                309 non-null    int64 
 5   PEER_PRESSURE          309 non-null    int64 
 6   CHRONIC DISEASE        309 non-null    int64 
 7   FATIGUE                309 non-null    int64 
 8   ALLERGY                309 non-null    int64 
 9   WHEEZING               309 non-null    int64 
 10  ALCOHOL CONSUMING      309 non-null    int64 
 11  COUGHING               309 non-null    int64 
 12  SHORTNESS OF BREATH    309 non-null    int64 
 13  SWALLOWING DIFFICULTY  309 non-null    int64 
 14  CHEST PAIN             309 non-null    int64 
 15  LUNG_CANCER            309 non-null    object
dtypes: int64(14), object(2)
memory usage: 38.8+ KB

df.isnull().sum()

GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL CONSUMING        0
COUGHING                 0
SHORTNESS OF BREATH      0
SWALLOWING DIFFICULTY    0
CHEST PAIN               0
LUNG_CANCER              0
dtype: int64

可见没有空值

2.数据序列化

df.GENDER.replace({"M":1,"F":0},inplace=True)
df.LUNG_CANCER.replace({"YES":1,"NO":0},inplace=True)

import matplotlib.pyplot as plt
%matplotlib inline

3.查看数据分布

figure,axes=plt.subplots(nrows=4,ncols=4,figsize=(20,16)) 
i=0

for column in df.columns:
    x=int(i/4)
    y=i%4
    df[column].value_counts().plot(ax=axes[x][y], kind='bar',title=f"{column} scatter gram")
    i=i+1

在这里插入图片描述

从上图可见，数据得癌症的比较多，其他的较为均衡。

4.抽烟与患病关系

smoke_yes=df.loc[df.SMOKING==2,["SMOKING","LUNG_CANCER"]]
smoke_no=df.loc[df.SMOKING==1,["SMOKING","LUNG_CANCER"]]

fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,figsize=(16,8))
ax1.pie(smoke_yes.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["yellow","green"],autopct='%1.1f%%',shadow=True,)
ax1.set_title("Lung Cancer & Smoking_YES")

ax2.pie(smoke_no.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["red","green"],autopct='%1.1f%%',shadow=True,)
ax2.set_title("Lung Cancer & Smoking_NO")

Text(0.5,1,'Lung Cancer & Smoking_NO')

在这里插入图片描述

5.过敏、饮酒、吞咽困难、胸疼与患癌关系

import seaborn as sns
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["ALLERGY "],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.COUGHING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["ALCOHOL CONSUMING"],ax=ax3,palette=['green', 'black'])

fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["SWALLOWING DIFFICULTY"],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.WHEEZING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["CHEST PAIN"],ax=ax3,palette=['green', 'black'])

<matplotlib.axes._subplots.AxesSubplot at 0x7fba81b66350>

在这里插入图片描述

6.绘制热力图

import seaborn as sns
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis',vmin=0, vmax=1)

<matplotlib.axes._subplots.AxesSubplot at 0x7fba83b48d90>

在这里插入图片描述

可见性别、年龄和是否抽烟与患肺癌相关性不大。

7.构造X、y

# 构造X、y
X=df.drop(columns=["LUNG_CANCER"],axis=1)
y=df["LUNG_CANCER"]

y.value_counts()

1    270
0     39
Name: LUNG_CANCER, dtype: int64

sns.countplot(y)

<matplotlib.axes._subplots.AxesSubplot at 0x7fba81a56590>

在这里插入图片描述

8.数据均衡

安装完要重启才能生效，不然报错，具体如下：

from IPython.display import clear_output
!pip install imblearn --user
!pip uninstall scipy -y
!pip install scipy --user

clear_output()

from imblearn.over_sampling import SMOTE

help(SMOTE)

sampling_strategy 有以下参数：

" minority’ ’ ':只重新采样少数类
" not minority’ ’ ':重采样除minority类外的所有类
" not majority’ ’ ':重采样除majority类外的所有类
" all’ ’ ':重采样所有类
" auto’ ’ ‘:相当于’ " not majority’

from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='minority')
X,y=smote.fit_resample(X,y)

sns.countplot(y)

<matplotlib.axes._subplots.AxesSubplot at 0x7fbd565994d0>

在这里插入图片描述

三、模型训练与评估

1.数据集划分

from sklearn.model_selection import train_test_split,cross_val_score

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=2023)

2.数据标准化

返回值为标准化后的数据
加载了 StandardScaler 类，并初始化了 StandardScaler 对象 scaler，使用 fit 方法，StandardScaler 从训练数据中估计每个特征维度的参数 μ (样本均值)和 σ (标准差)。通过调用 transform 方法，使用估计的参数 μ 和 σ 对训练和测试数据进行标准化。

from sklearn.preprocessing import StandardScaler

help(StandardScaler)

scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

print(X_train[0])

[-0.7710306   1.41036889  1.08508956  1.25031642  1.39864376  1.39096463
 -0.72288062  0.93078432 -0.70710678  1.36833491 -0.73479518  1.39096463
  0.88551735  1.53202723 -0.72288062]

3.随机森林训练

from sklearn.ensemble import RandomForestClassifier

rf=RandomForestClassifier()
rf.fit(X_train,y_train)
y_prdrf=rf.predict(X_test)

4.模型评估

from sklearn.metrics import classification_report,confusion_matrix

print(classification_report(y_test,y_prdrf))
cvs_rf=round(cross_val_score(rf,X,y,scoring="accuracy",cv=10).mean(),2)
print("Cross validation score for Random Forest Classifier model is:",cvs_rf)

              precision    recall  f1-score   support

           0       0.95      0.99      0.97        79
           1       0.98      0.93      0.95        56

    accuracy                           0.96       135
   macro avg       0.97      0.96      0.96       135
weighted avg       0.96      0.96      0.96       135

Cross validation score for Random Forest Classifier model is: 0.96

om Forest Classifier model is: 0.96

5.绘制混淆矩阵

sns.heatmap(confusion_matrix(y_test,y_prdrf),annot=True,cmap='viridis')
plt.xlabel("Predicted")
plt.ylabel("Truth")
plt.title("Confusion matrix- Random Forest Classifier")

Text(0.5,1,'Confusion matrix- Random Forest Classifier')

在这里插入图片描述

可以看出还是相当准确的。

此文章为搬运
原项目链接

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

RAPTOR：索引树状 RAG，使用树结构来捕捉文本的高级和低级细节

百度飞桨星河社区

MultiHop-RAG：多跳查询的基准检索增强生成

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1436条内容

	GENDER	AGE	SMOKING	YELLOW_FINGERS	ANXIETY	PEER_PRESSURE	CHRONIC DISEASE	FATIGUE	ALLERGY	WHEEZING	ALCOHOL CONSUMING	COUGHING	SHORTNESS OF BREATH	SWALLOWING DIFFICULTY	CHEST PAIN	LUNG_CANCER
0	M	69	1	2	2	1	1	2	1	2	2	2	2	2	2	YES
1	M	74	2	1	1	1	2	2	2	1	1	1	2	2	2	YES
2	F	59	1	1	1	2	1	2	1	2	1	2	2	1	2	NO
3	M	63	2	2	2	1	1	1	1	1	2	1	1	2	2	NO
4	F	63	1	2	1	1	1	1	1	2	1	2	2	1	1	NO

	GENDER	AGE	SMOKING	YELLOW_FINGERS	ANXIETY	PEER_PRESSURE	CHRONIC DISEASE	FATIGUE	ALLERGY	WHEEZING	ALCOHOL CONSUMING	COUGHING	SHORTNESS OF BREATH	SWALLOWING DIFFICULTY	CHEST PAIN	LUNG_CANCER
0	M	69	1	2	2	1	1	2	1	2	2	2	2	2	2	YES
1	M	74	2	1	1	1	2	2	2	1	1	1	2	2	2	YES
2	F	59	1	1	1	2	1	2	1	2	1	2	2	1	2	NO
3	M	63	2	2	2	1	1	1	1	1	2	1	1	2	2	NO
4	F	63	1	2	1	1	1	1	1	2	1	2	2	1	1	NO

	GENDER	AGE	SMOKING	YELLOW_FINGERS	ANXIETY	PEER_PRESSURE	CHRONIC DISEASE	FATIGUE	ALLERGY	WHEEZING	ALCOHOL CONSUMING	COUGHING	SHORTNESS OF BREATH	SWALLOWING DIFFICULTY	CHEST PAIN	LUNG_CANCER
0	M	69	1	2	2	1	1	2	1	2	2	2	2	2	2	YES
1	M	74	2	1	1	1	2	2	2	1	1	1	2	2	2	YES
2	F	59	1	1	1	2	1	2	1	2	1	2	2	1	2	NO
3	M	63	2	2	2	1	1	1	1	1	2	1	1	2	2	NO
4	F	63	1	2	1	1	1	1	1	2	1	2	2	1	1	NO