基于机器学习的【肺癌风险分析预测】
通过imblearn进行数据均衡,通过StandardScaler进行数据标准化,并选择随机森林分类模型进行训练、预测
★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>
一、肺癌风险预测
1.背景描述
癌症预测系统的有效性帮助人们以较低的成本了解自己的癌症风险,也帮助人们根据自己的癌症风险状况做出适当的决定。数据收集自在线肺癌预测网站。
2.数据说明
字段总数:16
实例数:284
字段信息:
1.性别:M(男性),F(女性)
2.年龄:病人的年龄
3.吸烟:YES=2 , NO=1
4.黄色的手指:YES=2 , NO=1
5.焦虑:YES=2 , NO=1
6.同伴压力: YES=2 , NO=1
7.慢性疾病:YES=2 , NO=1
8.疲劳:YES=2 , NO=1
9.过敏症:YES=2 , NO=1
10.喘息:YES=2 , NO=1
11.酒精:YES=2 , NO=1
12.咳嗽: YES=2 , NO=1
13.呼吸急促:YES=2 , NO=1
14.吞咽困难:YES=2 , NO=1
15.胸部疼痛:YES=2 , NO=1
16.肺癌:YES , NO
3.数据来源
https://www.kaggle.com/datasets/nancyalaswad90/lung-cancer
二、数据处理
1.读取数据
import pandas as pd
df=pd.read_csv("data/data209803/survey_lung_cancer.csv", index_col=None)
df.head()
GENDER | AGE | SMOKING | YELLOW_FINGERS | ANXIETY | PEER_PRESSURE | CHRONIC DISEASE | FATIGUE | ALLERGY | WHEEZING | ALCOHOL CONSUMING | COUGHING | SHORTNESS OF BREATH | SWALLOWING DIFFICULTY | CHEST PAIN | LUNG_CANCER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | M | 69 | 1 | 2 | 2 | 1 | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 2 | YES |
1 | M | 74 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | YES |
2 | F | 59 | 1 | 1 | 1 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 2 | 1 | 2 | NO |
3 | M | 63 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 2 | 2 | NO |
4 | F | 63 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 2 | 1 | 1 | NO |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 GENDER 309 non-null object
1 AGE 309 non-null int64
2 SMOKING 309 non-null int64
3 YELLOW_FINGERS 309 non-null int64
4 ANXIETY 309 non-null int64
5 PEER_PRESSURE 309 non-null int64
6 CHRONIC DISEASE 309 non-null int64
7 FATIGUE 309 non-null int64
8 ALLERGY 309 non-null int64
9 WHEEZING 309 non-null int64
10 ALCOHOL CONSUMING 309 non-null int64
11 COUGHING 309 non-null int64
12 SHORTNESS OF BREATH 309 non-null int64
13 SWALLOWING DIFFICULTY 309 non-null int64
14 CHEST PAIN 309 non-null int64
15 LUNG_CANCER 309 non-null object
dtypes: int64(14), object(2)
memory usage: 38.8+ KB
df.isnull().sum()
GENDER 0
AGE 0
SMOKING 0
YELLOW_FINGERS 0
ANXIETY 0
PEER_PRESSURE 0
CHRONIC DISEASE 0
FATIGUE 0
ALLERGY 0
WHEEZING 0
ALCOHOL CONSUMING 0
COUGHING 0
SHORTNESS OF BREATH 0
SWALLOWING DIFFICULTY 0
CHEST PAIN 0
LUNG_CANCER 0
dtype: int64
可见没有空值
2.数据序列化
df.GENDER.replace({"M":1,"F":0},inplace=True)
df.LUNG_CANCER.replace({"YES":1,"NO":0},inplace=True)
import matplotlib.pyplot as plt
%matplotlib inline
3.查看数据分布
figure,axes=plt.subplots(nrows=4,ncols=4,figsize=(20,16))
i=0
for column in df.columns:
x=int(i/4)
y=i%4
df[column].value_counts().plot(ax=axes[x][y], kind='bar',title=f"{column} scatter gram")
i=i+1
从上图可见,数据得癌症的比较多,其他的较为均衡。
4.抽烟与患病关系
smoke_yes=df.loc[df.SMOKING==2,["SMOKING","LUNG_CANCER"]]
smoke_no=df.loc[df.SMOKING==1,["SMOKING","LUNG_CANCER"]]
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2,figsize=(16,8))
ax1.pie(smoke_yes.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["yellow","green"],autopct='%1.1f%%',shadow=True,)
ax1.set_title("Lung Cancer & Smoking_YES")
ax2.pie(smoke_no.LUNG_CANCER.value_counts(normalize=True),labels=["YES","NO"],colors=["red","green"],autopct='%1.1f%%',shadow=True,)
ax2.set_title("Lung Cancer & Smoking_NO")
Text(0.5,1,'Lung Cancer & Smoking_NO')
5.过敏、饮酒、吞咽困难、胸疼与患癌关系
import seaborn as sns
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["ALLERGY "],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.COUGHING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["ALCOHOL CONSUMING"],ax=ax3,palette=['green', 'black'])
fig,(ax1,ax2,ax3)=plt.subplots(1,3,figsize=(30,8))
sns.countplot(df.LUNG_CANCER,hue=df["SWALLOWING DIFFICULTY"],ax=ax1,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df.WHEEZING,ax=ax2,palette=['green', 'black'])
sns.countplot(df.LUNG_CANCER,hue=df["CHEST PAIN"],ax=ax3,palette=['green', 'black'])
<matplotlib.axes._subplots.AxesSubplot at 0x7fba81b66350>
6.绘制热力图
import seaborn as sns
plt.figure(figsize=(16,10))
sns.heatmap(df.corr(),annot=True,cmap='viridis',vmin=0, vmax=1)
<matplotlib.axes._subplots.AxesSubplot at 0x7fba83b48d90>
可见性别、年龄和是否抽烟与患肺癌相关性不大。
7.构造X、y
# 构造X、y
X=df.drop(columns=["LUNG_CANCER"],axis=1)
y=df["LUNG_CANCER"]
y.value_counts()
1 270
0 39
Name: LUNG_CANCER, dtype: int64
sns.countplot(y)
<matplotlib.axes._subplots.AxesSubplot at 0x7fba81a56590>
8.数据均衡
安装完要重启才能生效,不然报错,具体如下:
from IPython.display import clear_output
!pip install imblearn --user
!pip uninstall scipy -y
!pip install scipy --user
clear_output()
from imblearn.over_sampling import SMOTE
help(SMOTE)
sampling_strategy 有以下参数:
- " minority’ ’ ':只重新采样少数类
- " not minority’ ’ ':重采样除minority类外的所有类
- " not majority’ ’ ':重采样除majority类外的所有类
- " all’ ’ ':重采样所有类
- " auto’ ’ ‘:相当于’ " not majority’
from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='minority')
X,y=smote.fit_resample(X,y)
sns.countplot(y)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbd565994d0>
三、模型训练与评估
1.数据集划分
from sklearn.model_selection import train_test_split,cross_val_score
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=2023)
2.数据标准化
-
返回值为标准化后的数据
-
加载了 StandardScaler 类,并初始化了 StandardScaler 对象 scaler,使用 fit 方法,StandardScaler 从训练数据中估计每个特征维度的参数 μ (样本均值)和 σ (标准差)。 通过调用 transform 方法,使用估计的参数 μ 和 σ 对训练和测试数据进行标准化。
from sklearn.preprocessing import StandardScaler
help(StandardScaler)
scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)
print(X_train[0])
[-0.7710306 1.41036889 1.08508956 1.25031642 1.39864376 1.39096463
-0.72288062 0.93078432 -0.70710678 1.36833491 -0.73479518 1.39096463
0.88551735 1.53202723 -0.72288062]
3.随机森林训练
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf.fit(X_train,y_train)
y_prdrf=rf.predict(X_test)
4.模型评估
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_prdrf))
cvs_rf=round(cross_val_score(rf,X,y,scoring="accuracy",cv=10).mean(),2)
print("Cross validation score for Random Forest Classifier model is:",cvs_rf)
precision recall f1-score support
0 0.95 0.99 0.97 79
1 0.98 0.93 0.95 56
accuracy 0.96 135
macro avg 0.97 0.96 0.96 135
weighted avg 0.96 0.96 0.96 135
Cross validation score for Random Forest Classifier model is: 0.96
om Forest Classifier model is: 0.96
5.绘制混淆矩阵
sns.heatmap(confusion_matrix(y_test,y_prdrf),annot=True,cmap='viridis')
plt.xlabel("Predicted")
plt.ylabel("Truth")
plt.title("Confusion matrix- Random Forest Classifier")
Text(0.5,1,'Confusion matrix- Random Forest Classifier')
可以看出还是相当准确的。
此文章为搬运
原项目链接
更多推荐
所有评论(0)