Merge branch 'main' of https://gitty.informatik.hs-mannheim.de/1826514/DSA_SS24
commit
4c90f4cb40
38
README.md
38
README.md
|
@ -243,14 +243,44 @@ For machine learning, the initial step involved tailoring the features for the m
|
|||
|
||||
## Cluster-analysis
|
||||
(version 03.07)
|
||||
[](notebooks/...)
|
||||
|
||||
To enhance our understanding of the feature clusters and their similarity to the original data labels, we initiated our analysis by preparing the data. This preparation involved normalization and imputation of missing values with mean substitution. Subsequently, we employed the K-Means algorithm for distance-based clustering of the data. Our exploration focused on comparing two types of labels: those derived from the K-Means algorithm and the original dataset labels. Various visualizations were generated to facilitate this comparison.
|
||||
|
||||
A dimensionality reduction plot using Principal Component Analysis (PCA) revealed that, although the clusters formed by the K-Means algorithm and the original labels are not highly similar, both exhibit some degree of clustering. To quantitatively assess the quality of the K-Means clusters, we calculated the following metrics:
|
||||
- Adjusted Rand Index (ARI): 0.15
|
||||
- Normalized Mutual Information (NMI): 0.24
|
||||
- Silhouette Score: 0.47
|
||||
|
||||
The ARI and NMI scores indicate that the clustering algorithm has a moderate level of effectiveness in reflecting the structure of the true labels, albeit not with high accuracy. These scores suggest some alignment with the true labels, but the clustering does not perfectly capture the underlying groupings. This suggests that the distances between features, as determined by the clustering algorithm, do not fully mirror the inherent categorizations indicated by the original labels of the data.
|
||||
|
||||
The Silhouette Score suggests that the clusters identified are internally cohesive and distinct from each other, indicating that the clustering algorithm has been somewhat successful in identifying meaningful structures within the data, even if these structures do not align perfectly with the true labels.
|
||||
|
||||
Further analysis included the creation of a Euclidean distance matrix plot to visualize patterns of data point separation. This analysis revealed the presence of outliers, as some data points were significantly more distant from others.
|
||||
|
||||
Finally, a parallel axis plot was generated to examine the relationship between data features and the clusters. Notably, this plot highlighted the ventricular rate feature as a significant separator in the original labels, underscoring its importance as identified by our machine learning models in predicting the labels.
|
||||
|
||||
|
||||
## Legal basis
|
||||
![Alt-Text](readme_data/Cluster_analysis.png)
|
||||
|
||||
|
||||
<br>The detailed procedures can be found in the following notebook:
|
||||
<br>[cluster_features.ipynb](notebooks/cluster_features.ipynb)
|
||||
|
||||
## Legal Basis and Data Biases
|
||||
(version 03.07)
|
||||
|
||||
- The data used all come from one hospital
|
||||
- Most of the data are from people of older age, predominantly from the 60-70 age group
|
||||
### Local Bias
|
||||
- The dataset originates exclusively from one hospital, encompassing contributions from Chapman University, Shaoxing People’s Hospital (affiliated with Shaoxing Hospital Zhejiang University School of Medicine), and Ningbo First Hospital. This may introduce a local bias, as all data are collected from a specific geographic and institutional context.
|
||||
|
||||
### Demographic Bias
|
||||
- The dataset predominantly features data from older individuals, with the majority of participants falling within the 60-70 age group. This demographic skew is further detailed by:
|
||||
- Average age: 59.59 years
|
||||
- Standard deviation of age: 18.29 years
|
||||
- Male ratio: 57.34%
|
||||
- Female ratio: 42.66%
|
||||
This indicates a potential demographic bias towards older age groups and a gender imbalance.
|
||||
|
||||
# TODO
|
||||
- Zustimmung und Anonymität:
|
||||
- Datenschutz und Ethik:
|
||||
|
||||
|
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
|
@ -11,7 +11,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -59,6 +59,69 @@
|
|||
"df_dgc['age_group'] = pd.cut(df_dgc['age'], bins=age_categories)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Average age: 59.58733889924617\n",
|
||||
"Std Dev age: 18.29087120360519\n",
|
||||
"Average age group: age_group\n",
|
||||
"(0, 10] 6.715503\n",
|
||||
"(10, 20] 16.360606\n",
|
||||
"(20, 30] 26.066710\n",
|
||||
"(30, 40] 35.847409\n",
|
||||
"(40, 50] 46.229902\n",
|
||||
"(50, 60] 55.403579\n",
|
||||
"(60, 70] 65.557701\n",
|
||||
"(70, 80] 75.208785\n",
|
||||
"(80, 90] 84.706091\n",
|
||||
"Name: age, dtype: float64\n",
|
||||
"Std Dev age group: age_group\n",
|
||||
"(0, 10] 1.883777\n",
|
||||
"(10, 20] 2.817185\n",
|
||||
"(20, 30] 2.968634\n",
|
||||
"(30, 40] 2.878519\n",
|
||||
"(40, 50] 2.749121\n",
|
||||
"(50, 60] 2.936383\n",
|
||||
"(60, 70] 2.884971\n",
|
||||
"(70, 80] 2.945118\n",
|
||||
"(80, 90] 2.749137\n",
|
||||
"Name: age, dtype: float64\n",
|
||||
"Male Ratio: 0.5733970981600065\n",
|
||||
"Female Ratio: 0.42657588284564046\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# avg age and std dev overall and for each group\n",
|
||||
"avg_age = df_dgc['age'].mean()\n",
|
||||
"std_age = df_dgc['age'].std()\n",
|
||||
"avg_age_group = df_dgc.groupby('age_group')['age'].mean()\n",
|
||||
"std_age_group = df_dgc.groupby('age_group')['age'].std()\n",
|
||||
"\n",
|
||||
"# print \n",
|
||||
"print(\"Average age: \", avg_age)\n",
|
||||
"print(\"Std Dev age: \", std_age)\n",
|
||||
"print(\"Average age group: \", avg_age_group)\n",
|
||||
"print(\"Std Dev age group: \", std_age_group)\n",
|
||||
"\n",
|
||||
"# female and male ratio\n",
|
||||
"count_male = df_dgc[df_dgc['gender'] == 'Male'].shape[0]\n",
|
||||
"count_female = df_dgc[df_dgc['gender'] == 'Female'].shape[0]\n",
|
||||
"count_total = df_dgc.shape[0]\n",
|
||||
"male_ratio = count_male / count_total\n",
|
||||
"female_ratio = count_female / count_total\n",
|
||||
"\n",
|
||||
"# print\n",
|
||||
"print('Male Ratio: ', male_ratio)\n",
|
||||
"print('Female Ratio:', female_ratio)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
|
File diff suppressed because one or more lines are too long
Binary file not shown.
After Width: | Height: | Size: 732 KiB |
Binary file not shown.
|
@ -13,7 +13,7 @@ import cv2 as cv
|
|||
TODO create overall description
|
||||
"""
|
||||
|
||||
def load_data(only_demographic:bool=False, path_settings:str="../settings.json"):
|
||||
def load_data(only_demographic:bool=False, only_diagnosis_ids=False, path_settings:str="../settings.json"):
|
||||
"""
|
||||
Loads data from pickle files based on the specified settings.
|
||||
|
||||
|
@ -28,6 +28,10 @@ def load_data(only_demographic:bool=False, path_settings:str="../settings.json")
|
|||
path_data = settings["data_path"]
|
||||
labels = settings["labels"]
|
||||
|
||||
if only_diagnosis_ids:
|
||||
with open(f'{path_data}/diagnosis.pkl', 'rb') as f:
|
||||
return pickle.load(f)
|
||||
|
||||
data = {}
|
||||
if only_demographic:
|
||||
data = {'age': [], 'diag': [], 'gender': []}
|
||||
|
|
|
@ -5,6 +5,7 @@ import math
|
|||
import time
|
||||
from multiprocessing import Pool
|
||||
import sqlite3
|
||||
import random
|
||||
|
||||
def get_y_value(ecg_cleaned, indecies):
|
||||
"""
|
||||
|
@ -213,7 +214,6 @@ def extract_features_parallel(data_dict, num_processes, sampling_rate=500, used_
|
|||
c = conn.cursor()
|
||||
# get unique data
|
||||
data_dict = exclude_already_extracted(data_dict, conn)
|
||||
|
||||
for label, data in data_dict.items():
|
||||
print(f"Extracting features for {label} with {len(data)} data entries.")
|
||||
with Pool(processes=num_processes) as pool:
|
||||
|
@ -239,7 +239,7 @@ def extract_features_parallel(data_dict, num_processes, sampling_rate=500, used_
|
|||
|
||||
|
||||
|
||||
def extract_features(data_dict, sampling_rate=500, used_channels=[0, 1, 2, 3, 4, 5]):
|
||||
def extract_features(data_dict, sampling_rate=500, used_channels=[0, 1, 2, 3, 4, 5], limit=1000):
|
||||
"""
|
||||
Extracts the features from the data.
|
||||
Args:
|
||||
|
@ -266,6 +266,8 @@ def extract_features(data_dict, sampling_rate=500, used_channels=[0, 1, 2, 3, 4,
|
|||
print("No last file in DB")
|
||||
|
||||
for label, data in data_dict.items():
|
||||
# get limit amount of radom samples out of data
|
||||
data = random.sample(data, min(len(data), limit))
|
||||
print(f"Extracting features for {label} with {len(data)} data entries.")
|
||||
for data_idx, record in enumerate(data):
|
||||
# Skip the records that are already in the database
|
||||
|
|
|
@ -30,7 +30,7 @@ def get_diagnosis_ids(record):
|
|||
list_diagnosis = [int(x.strip()) for x in diagnosis.split(',')]
|
||||
return list_diagnosis
|
||||
|
||||
def generate_raw_data(path_to_data, settings, max_counter=100_000):
|
||||
def generate_raw_data(path_to_data, settings, max_counter=100_000, only_ids=False):
|
||||
"""
|
||||
Generates the raw data from the WFDB records.
|
||||
Args:
|
||||
|
@ -43,6 +43,9 @@ def generate_raw_data(path_to_data, settings, max_counter=100_000):
|
|||
failed_records = []
|
||||
categories = settings["labels"]
|
||||
|
||||
if only_ids:
|
||||
diag_dict = {}
|
||||
else:
|
||||
diag_dict = {k: [] for k in categories.keys()}
|
||||
# Loop through the records
|
||||
for dir_th in os.listdir(path_to_data):
|
||||
|
@ -60,6 +63,9 @@ def generate_raw_data(path_to_data, settings, max_counter=100_000):
|
|||
record = wfdb.rdrecord(path_to_100_records + '/' + record_name)
|
||||
# Get the diagnosis
|
||||
diagnosis = np.array(get_diagnosis_ids(record))
|
||||
if only_ids:
|
||||
diag_dict[record_name] = diagnosis
|
||||
else:
|
||||
# check if diagnosis is a subset of one of the categories
|
||||
for category_name, category_codes in categories.items():
|
||||
# if any of the diagnosis codes is in the category_codes
|
||||
|
@ -83,7 +89,7 @@ def generate_raw_data(path_to_data, settings, max_counter=100_000):
|
|||
break
|
||||
return diag_dict
|
||||
|
||||
def write_data(data_dict, path='./data', file_prefix=''):
|
||||
def write_data(data_dict, path='./data', file_prefix='', only_ids=False):
|
||||
"""
|
||||
Writes the data to a pickle file.
|
||||
Args:
|
||||
|
@ -93,6 +99,13 @@ def write_data(data_dict, path='./data', file_prefix=''):
|
|||
# if path not exists create it
|
||||
if not os.path.exists(path):
|
||||
os.makedirs(path)
|
||||
|
||||
if only_ids:
|
||||
# write to pickle
|
||||
print(f"Writing diagnosis IDs to pickle with {len(data_dict)} data entries.")
|
||||
with open(f'{path}/{file_prefix}.pkl', 'wb') as f:
|
||||
pickle.dump(data_dict, f)
|
||||
return
|
||||
# write to pickle
|
||||
for cat_name, data in data_dict.items():
|
||||
print(f"Writing {cat_name} to pickle with {len(data)} data entries.")
|
||||
|
@ -114,7 +127,7 @@ def generate_feature_data(input_data_path, settings, parallel=False, split_ratio
|
|||
split_ratio = settings['split_ratio']
|
||||
print(list(os.listdir(input_data_path)))
|
||||
for file in os.listdir(input_data_path):
|
||||
if file.endswith(".pkl"):
|
||||
if file.endswith(".pkl") and not file.startswith("diagnosis"):
|
||||
print(f"Reading {file}")
|
||||
with open(f'{input_data_path}/{file}', 'rb') as f:
|
||||
data = pickle.load(f)
|
||||
|
@ -127,13 +140,14 @@ def generate_feature_data(input_data_path, settings, parallel=False, split_ratio
|
|||
print(f"Using {max_processes} processes to extract features.")
|
||||
feature_extraction.extract_features_parallel(data_dict, num_processes=max_processes)
|
||||
else:
|
||||
feature_extraction.extract_features(data_dict)
|
||||
print(f"For even distribution of data, the limit is set to the smallest size: 1000.")
|
||||
feature_extraction.extract_features(data_dict, limit=1000)
|
||||
# Split the data
|
||||
feature_extraction.split_and_shuffle_data(split_ratio=split_ratio)
|
||||
|
||||
|
||||
|
||||
def main(gen_data=True, gen_features=True, split_ratio=None, parallel=False, settings_path='./settings.json', num_process_files=-1):
|
||||
def main(gen_data=True, gen_features=True, gen_diag_ids=True, split_ratio=None, parallel=False, settings_path='./settings.json', num_process_files=-1):
|
||||
"""
|
||||
Main function to generate the data.
|
||||
Args:
|
||||
|
@ -159,6 +173,11 @@ def main(gen_data=True, gen_features=True, split_ratio=None, parallel=False, set
|
|||
if gen_features:
|
||||
feature_data_dict = generate_feature_data(settings["data_path"], settings, split_ratio=split_ratio, parallel=parallel)
|
||||
ret_data = feature_data_dict
|
||||
if gen_diag_ids:
|
||||
raw_data_dir = settings["wfdb_path"] + '/WFDBRecords'
|
||||
data_dict = generate_raw_data(raw_data_dir, settings, max_counter=num_process_files, only_ids=True)
|
||||
write_data(data_dict, path=settings["data_path"], file_prefix='diagnosis', only_ids=True)
|
||||
ret_data = data_dict
|
||||
|
||||
return ret_data
|
||||
|
||||
|
@ -178,6 +197,7 @@ if __name__ == '__main__':
|
|||
# SB, AFIB, GSVT, SR
|
||||
# new GSVT, AFIB, SR, SB
|
||||
# Generate the data
|
||||
main(gen_data=True, gen_features=False, num_process_files=100_000)
|
||||
#main(gen_data=False, gen_features=True, split_ratio=[0.8, 0.1, 0.1], parallel=False, num_process_files=100_000)
|
||||
#main(gen_data=True, gen_features=False, gen_diag_ids=False, num_process_files=100_000)
|
||||
#main(gen_data=False, gen_features=True, gen_diag_ids=False, split_ratio=[0.8, 0.1, 0.1])
|
||||
main(gen_data=False, gen_features=False, gen_diag_ids=True)
|
||||
print("Data generation completed.")
|
||||
|
|
|
@ -1,15 +0,0 @@
|
|||
{
|
||||
"wfdb_path_comment": "Path to the WFDB data. This is the folder where the WFDB data is stored.",
|
||||
"wfdb_path": "C:/Users/arman/PycharmProjects/pythonProject/DSA/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0",
|
||||
"data_path_comment": "Path to the data folder. This is the folder where the genereated data is stored.",
|
||||
"data_path": "C:/Users/arman/PycharmProjects/pythonProject/DSA/DSA_SS24/data",
|
||||
"labels_comment": "Labels for the different classes. The labels are the SNOMED CT codes.",
|
||||
"labels": {
|
||||
"GSVT": [426761007, 713422000, 233896004, 233897008, 713422000],
|
||||
"AFIB": [164889003, 164890007],
|
||||
"SR": [426783006, 427393009],
|
||||
"SB": [426177001]
|
||||
},
|
||||
"split_ratio_comment": "Ratio for the train-test-validation split. The first value is the ratio for the training data, the second value is the ratio for the test data.",
|
||||
"split_ratio": [0.8, 0.1, 0.1]
|
||||
}
|
Loading…
Reference in New Issue