completed repo setup

main
Felix Jan Michael Mucha 2024-05-15 20:20:01 +02:00
parent 6b383a5276
commit e0a09ae1b5
8 changed files with 413 additions and 329 deletions

21
LICENSE.txt 100644
View File

@ -0,0 +1,21 @@
MIT License
Copyright (c) [2024] [Klara Tabea Bracke, Arman Ulusoy, Nils Rekus, Felix Jan Michael Mucha]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@ -1,7 +1,9 @@
# HSMA Data Science and Analytics SS2024
## ECG Data klassifier and segmentation
[This Project aims to klassify ... and semgment ... ECG Data]
This project was developed through the Data Science and Analytics course at the Mannheim University of Applied Sciences. A data science cycle was taught theoretically on the basis of lectures and implemented practically in the project.
## Analysis of cardiovascular diseases using ECG data
## Table of Contents
- [About](#about)
@ -10,39 +12,64 @@
- [Progress](#progress)
- [Contributing](#contributing)
- [License](#license)
- [Acknowledgements] (#acknowledgements)
- [Contact] (#contact)
## About
[Provide a brief overview of the project, including its purpose and any relevant background information.]
Cardiovascular diseases refer to a group of diseases that affect the heart and blood vessels and represent a significant global health burden. They are a leading cause of morbidity and mortality worldwide, making effective prevention and management of these diseases critical. Physical examinations, blood tests, ECGs, stress or exercise tests, echocardiograms and CT or MRI scans are used to diagnose cardiovascular disease.
(source: https://www.netdoktor.de/krankheiten/herzkrankheiten/, last visit: 15.05.2024)
An electrocardiogram (ECG) is a method of recording the electrical activity of the heart over a certain period of time. As an important diagnostic technique in cardiology, it is used to detect cardiac arrhythmias, heart attacks and other cardiovascular diseases. The ECG displays this electrical activity as waves and lines on a paper or screen. According to current screening and diagnostic practices, either cardiologists or physicians review the ECG data, determine the correct diagnosis and begin implementing subsequent treatment plans such as medication regimens and radiofrequency catheter ablation.
(https://flexikon.doccheck.com/de/Elektrokardiogramm, last visit: 15.05.2024)
The project uses a dataset from a 12-lead electrocardiogram database published in August 2022. The database was developed under the auspices of Chapman University, Shaoxing People's Hospital and Ningbo First Hospital to support research on arrhythmias and other cardiovascular diseases. The dataset contains detailed data from 45,152 patients, recorded at a sampling rate of 500 Hz, and includes several common rhythms as well as additional cardiovascular conditions. The diagnoses are divided into four main categories: SB (sinus bradycardia), AFIB (atrial fibrillation and atrial flutter), GSVT (supraventricular tachycardia) and SR (sinus rhythm and sinus irregularities). The ECG data was stored in the GE MUSE ECG system and exported to XML files. A conversion tool was developed to convert the data to CSV format, which was later converted to WFDB format.
(source: https://doi.org/10.13026/wgex-er52, last visit: 15.05.2024)
The data set used in this project was divided into four main groups: SB, AFIB, GSVT and SR. The choice of these groups is based on the results from the paper “Optimal Multi-Stage Arrhythmia Classification Approach” by Jianwei Zheng, Huimin Chu et al., this choice in turn is based on expert opinions from 11 physicians. Each group represents different cardiac arrhythmias that can be identified by electrocardiographic (ECG) features.
(source: https://rdcu.be/dH2jI, last visit: 15.05.2024)
The data provision provides for the following points, which can be taken from the diagram.
![Alt-Text](readme_data/flow_diag.png)
## Getting Started
[Instructions on how to get the project up and running on a local machine. Include prerequisites, installation steps, and any other necessary setup.]
This project was implemented in Python. To use the project, all packages listed in the requirements.txt file need to be installed first. After that, you can interact with the project as follows:
### Prerequisites
[List any software or tools that need to be installed before running the project.]
### Installation
[Step-by-step guide on how to install the project.]
1. Ensure you have 10GB of available space.
2. First, visit the website and download the dataset (https://doi.org/10.13026/wgex-er52, last visit: 15.05.2024).
3. Extract the data.
4. Open the generate_data.py script and adjust the "project_dir" path to point to the downloaded data.
5. Run the generate_data.py script as the main file. This will generate several pickle files, which may take some time.
6. You can now use the notebooks by adjusting the "path" variable in the top lines of each notebook to point to the pickle files.
## Usage
[Provide examples and instructions for using the project. Include any relevant code snippets or screenshots.]
- coming at the end of the Project...
## Progress
- Data was searched and found at : https://doi.org/10.13026/wgex-er52
- Data was cleaned:
- Docker Container with MongoDB, because 10 GB and many arrays
- Diagnosis from String to list
- Data filtered to contain only healthy and the needed diagnosis data
- Data was searched and found at : (https://doi.org/10.13026/wgex-er52, last visit: 15.05.2024)
- Data was cleaned
- Demographic data was plotted
- Start exploring signal processing
## Contributing
[Explain how others can contribute to the project. This might include guidelines for reporting bugs, submitting enhancements, or proposing new features.]
- coming at the end of the Project...
## License
[Specify the license under which the project is distributed. Include any additional terms or conditions.]
This project is licensed under the [MIT License](https://opensource.org/licenses/MIT).
## Acknowledgements
[Optional section to thank individuals or organizations that have contributed to the project.]
We would like to especially thank our instructor, Ms. Jacqueline Franßen, for her enthusiastic support in helping us realize this project.
## Contact
[Provide contact information for inquiries or support.]
- Klara Tabea Bracke
- Arman Ulusoy
- Nils Rekus
- Felix Jan Michael Mucha (felixjanmichael.mucha@stud.hs-mannheim.de)

View File

@ -1,9 +1,44 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Quality\n",
"\n",
"This notebook is used to ensure data quality for further evaluations. For this reason, it is examined how much of the data is incomplete. It is important that this only affects a small part of the data in order to avoid any distortion of the data in further analyses."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pickle"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"path = \"C:/Studium/dsa/data\"\n",
"#path = \"C:/Users/Nils/Documents/HS-Mannheim/0000_MASTER/DSA/EKG_Prog/data\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
@ -21,13 +56,6 @@
}
],
"source": [
"import pickle\n",
"from matplotlib import pyplot as plt\n",
"import wfdb\n",
"# read pickle files and check len and print first record and first record keys\n",
"\n",
"path = \"C:/Studium/dsa/data\"\n",
"#path = \"C:/Users/Nils/Documents/HS-Mannheim/0000_MASTER/DSA/EKG_Prog/data\"\n",
"\n",
"categories_dict = {\n",
"'SB': [426177001],\n",
@ -50,12 +78,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check for missing data in timeseries"
"## Check for missing data"
]
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 4,
"metadata": {},
"outputs": [
{
@ -63,82 +91,6 @@
"output_type": "stream",
"text": [
"First record in SB: ['Age: 59', 'Sex: Female', 'Dx: 426177001,164934002', 'Rx: Unknown', 'Hx: Unknown', 'Sx: Unknown']\n",
"Missing sex in JS34080 and comments: Sex: Unknown\n",
"Missing age in JS12543 and comments: Age: NaN\n",
"Missing age in JS12571 and comments: Age: NaN\n",
"Missing age in JS12576 and comments: Age: NaN\n",
"Missing sex in JS12576 and comments: Sex: Unknown\n",
"Missing age in JS12609 and comments: Age: NaN\n",
"Missing sex in JS12609 and comments: Sex: Unknown\n",
"Missing age in JS13024 and comments: Age: NaN\n",
"Missing sex in JS13024 and comments: Sex: Unknown\n",
"Missing age in JS13504 and comments: Age: NaN\n",
"Missing age in JS13505 and comments: Age: NaN\n",
"Missing age in JS13575 and comments: Age: NaN\n",
"Missing age in JS13583 and comments: Age: NaN\n",
"Missing age in JS13645 and comments: Age: NaN\n",
"Missing age in JS13646 and comments: Age: NaN\n",
"Missing age in JS13647 and comments: Age: NaN\n",
"Missing age in JS14027 and comments: Age: NaN\n",
"Missing age in JS14050 and comments: Age: NaN\n",
"Missing age in JS14498 and comments: Age: NaN\n",
"Missing age in JS14555 and comments: Age: NaN\n",
"Missing age in JS14995 and comments: Age: NaN\n",
"Missing sex in JS14995 and comments: Sex: Unknown\n",
"Missing age in JS18505 and comments: Age: NaN\n",
"Missing age in JS18506 and comments: Age: NaN\n",
"Missing age in JS18507 and comments: Age: NaN\n",
"Missing age in JS18508 and comments: Age: NaN\n",
"Missing age in JS18509 and comments: Age: NaN\n",
"Missing age in JS18510 and comments: Age: NaN\n",
"Missing age in JS18511 and comments: Age: NaN\n",
"Missing age in JS18512 and comments: Age: NaN\n",
"Missing age in JS18513 and comments: Age: NaN\n",
"Missing age in JS18514 and comments: Age: NaN\n",
"Missing age in JS18515 and comments: Age: NaN\n",
"Missing sex in JS18515 and comments: Sex: Unknown\n",
"Missing age in JS18574 and comments: Age: NaN\n",
"Missing age in JS19386 and comments: Age: NaN\n",
"Missing age in JS19447 and comments: Age: NaN\n",
"Missing age in JS10867 and comments: Age: NaN\n",
"Missing sex in JS10867 and comments: Sex: Unknown\n",
"Missing age in JS11507 and comments: Age: NaN\n",
"Missing sex in JS11507 and comments: Sex: Unknown\n",
"Missing age in JS22918 and comments: Age: NaN\n",
"Missing sex in JS22918 and comments: Sex: Unknown\n",
"Missing age in JS23063 and comments: Age: NaN\n",
"Missing sex in JS23063 and comments: Sex: Unknown\n",
"Missing age in JS23064 and comments: Age: NaN\n",
"Missing age in JS23787 and comments: Age: NaN\n",
"Missing sex in JS23787 and comments: Sex: Unknown\n",
"Missing age in JS24143 and comments: Age: NaN\n",
"Missing sex in JS24143 and comments: Sex: Unknown\n",
"Missing age in JS24144 and comments: Age: NaN\n",
"Missing sex in JS24144 and comments: Sex: Unknown\n",
"Missing age in JS24145 and comments: Age: NaN\n",
"Missing age in JS45355 and comments: Age: NaN\n",
"Missing age in JS45356 and comments: Age: NaN\n",
"Missing age in JS45357 and comments: Age: NaN\n",
"Missing age in JS45358 and comments: Age: NaN\n",
"Missing age in JS45359 and comments: Age: NaN\n",
"Missing age in JS45360 and comments: Age: NaN\n",
"Missing sex in JS45360 and comments: Sex: Unknown\n",
"Missing age in JS45361 and comments: Age: NaN\n",
"Missing sex in JS45361 and comments: Sex: Unknown\n",
"Missing age in JS45364 and comments: Age: NaN\n",
"Missing age in JS45367 and comments: Age: NaN\n",
"Missing sex in JS45367 and comments: Sex: Unknown\n",
"Missing age in JS45369 and comments: Age: NaN\n",
"Missing age in JS45370 and comments: Age: NaN\n",
"Missing sex in JS45370 and comments: Sex: Unknown\n",
"Missing age in JS45382 and comments: Age: NaN\n",
"Missing sex in JS45382 and comments: Sex: Unknown\n",
"Missing age in JS45383 and comments: Age: NaN\n",
"Missing sex in JS45383 and comments: Sex: Unknown\n",
"Missing age in JS45384 and comments: Age: NaN\n",
"Missing sex in JS45384 and comments: Sex: Unknown\n",
"Missing age in JS45385 and comments: Age: NaN\n",
"Missing sex in JS45385 and comments: Sex: Unknown\n",
"Missing timeseries in 0 records\n",
"Missing age in 55 records\n",
"Missing sex in 21 records\n"
@ -160,11 +112,9 @@
" #if record.comments[2]== '':\n",
" if 'Age: ' not in record.comments[0] or record.comments[0] == 'Age: NaN':\n",
" missing_age.append(record)\n",
" print(f\"Missing age in {record.record_name} and comments: {record.comments[0]}\")\n",
" if record.comments[1] == 'Sex: Unknown' or record.comments[1] == '':\n",
" missing_sex.append(record)\n",
" print(f\"Missing sex in {record.record_name} and comments: {record.comments[1]}\")\n",
" \n",
" \n",
"print(f\"Missing timeseries in {len(missing_timeseries)} records\")\n",
"print(f\"Missing age in {len(missing_age)} records\")\n",
"print(f\"Missing sex in {len(missing_sex)} records\")"

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 110 KiB

BIN
requirements.txt 100644

Binary file not shown.

View File

@ -1,165 +1,105 @@
"""
This script reads the WFDB records and extracts the diagnosis information from the comments.
The diagnosis information is then used to classify the records into categories.
The categories are defined by the diagnosis codes in the comments.
The records are then saved to pickle files based on the categories.
"""
import wfdb
import os
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import pickle
#project_dir = "C:/Users/Nils/Documents/0000MASTER/IM1/DSA/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0"
# Directories and file paths
# --------------------------------------------------------------------------------
# Specify the directory where the WFDB records are stored
# NOTE: Specify the directory where the WFDB records are stored
project_dir = 'C:/Users/felix/OneDrive/Studium/Master MDS/1 Semester/DSA/physionet/large_12_ecg_data/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0'
data_dir = project_dir + '/WFDBRecords'
path_diag_lookup = "C:/Users/felix/OneDrive/Studium/Master MDS/1 Semester/DSA/physionet/large_12_ecg_data/a-large-scale-12-lead-electrocardiogram-database-for-arrhythmia-study-1.0.0/ConditionNames_SNOMED-CT.csv"
#path_diag_lookup = project_dir + '/ConditionNames_SNOMED-CT.csv'
#project_dir +'/ConditionNames_SNOMED-CT.csv'
path_diag_lookup = project_dir + "/ConditionNames_SNOMED-CT.csv"
# --------------------------------------------------------------------------------
# Functions
def get_diagnosis_ids(record):
"""
Extracts diagnosis IDs from a record and returns them as a list.
Args:
record (object): The record object containing the diagnosis information.
Returns:
list: A list of diagnosis IDs extracted from the record.
"""
# Get the diagnosis
diagnosis = record.comments[2]
# clean the diagnosis
diagnosis = diagnosis.replace('Dx: ', '')
list_diagnosis = [int(x.strip()) for x in diagnosis.split(',')]
return list_diagnosis
def get_diagnosis_name(diagnosis):
# get the diagnosis name from the lookup table
name = [diagnosis_lookup[diagnosis_lookup['Snomed_CT'] == x]['Full Name'].to_string(index=False) for x in diagnosis]
return name
def filter_signal_df_on_diag(df_dict, diagnosis_dict, filter_codes_df):
# Create a list with filter codes and add 0 for padding
filter_cod_li = list(filter_codes_df['Snomed_CT']) + [0]
# Filter the diagnosis dictionary based on the filter codes
filter_dict_diag = {k: v for k, v in diagnosis_dict.items() if all(i in filter_cod_li for i in v)}
# Filter the df_dict based on the filtered_dict_diag
filtered_df_dict = {i: df.loc[df.index.isin(filter_dict_diag.keys())] for i, df in df_dict.items()}
return filtered_df_dict
# --------------------------------------------------------------------------------
# Explore the data
# Generate the data
# --------------------------------------------------------------------------------
# Read the diagnosis lookup table
diagnosis_lookup = pd.read_csv(path_diag_lookup)
#print(diagnosis_lookup.head())
if __name__ == '__main__':
"""
The following categories are used to classify the records:
# Filter data based on the diagnosis
SB, Sinusbradykardie
AFIB, Vorhofflimmern und Vorhofflattern (AFL)
GSVT, supraventrikulärer Tachykardie, Vorhoftachykardie, AV-Knoten-Reentry-Tachykardie, AV-Reentry-Tachykardie, Vorhofschrittmacher
SR Sinusrhythmus und Sinusunregelmäßigkeiten
"""
categories = {
'SB': [426177001],
'AFIB': [164889003, 164890007],
'GSVT': [426761007, 713422000, 233896004, 233897008, 713422000],
'SR': [426783006, 427393009]
}
diag_dict = {k: [] for k in categories.keys()}
# ----------------------------------------------
"""
SB, Sinusbradykardie
AFIB, Vorhofflimmern und Vorhofflattern (AFL)
GSVT, supraventrikulärer Tachykardie, Vorhoftachykardie, AV-Knoten-Reentry-Tachykardie, AV-Reentry-Tachykardie, Vorhofschrittmacher
SR Sinusrhythmus und Sinusunregelmäßigkeiten
(Vorhofschrittmacher = 713422000)
"""
categories = {
'SB': [426177001],
'AFIB': [164889003, 164890007],
'GSVT': [426761007, 713422000, 233896004, 233897008, 713422000],
'SR': [426783006, 427393009]
}
#diag_dict = {k: 0 for k in categories.keys()}
diag_dict = {k: [] for k in categories.keys()}
# Create a counter for the number of records
counter = 0
max_counter = 100#100_000
# Loop through the records
for dir_th in os.listdir(data_dir):
path_to_1000_records = data_dir + '/' + dir_th
for dir_hd in os.listdir(path_to_1000_records):
path_to_100_records = path_to_1000_records + '/' + dir_hd
for record_name in os.listdir(path_to_100_records):
# check if .hea is in the record_name
if '.hea' not in record_name:
continue
# Remove the .hea extension from record_name
record_name = record_name.replace('.hea', '')
try:
# Read the record
record = wfdb.rdrecord(path_to_100_records + '/' + record_name)
# Get the diagnosis
diagnosis = np.array(get_diagnosis_ids(record))
# check if diagnosis is a subset of one of the categories
for category_name, category_codes in categories.items():
#if set(diagnosis).issubset(set(category_codes)):
# if any of the diagnosis codes is in the category_codes
if any(i in category_codes for i in diagnosis):
# Increment the counter for the category
#diag_dict[category_name] += 1
# Add record to the category
diag_dict[category_name].append(record)
# Create a counter for the number of records
counter = 0
max_counter = 100_000
failed_records = []
# Loop through the records
for dir_th in os.listdir(data_dir):
path_to_1000_records = data_dir + '/' + dir_th
for dir_hd in os.listdir(path_to_1000_records):
path_to_100_records = path_to_1000_records + '/' + dir_hd
for record_name in os.listdir(path_to_100_records):
# check if .hea is in the record_name
if '.hea' not in record_name:
continue
# Remove the .hea extension from record_name
record_name = record_name.replace('.hea', '')
try:
# Read the record
record = wfdb.rdrecord(path_to_100_records + '/' + record_name)
# Get the diagnosis
diagnosis = np.array(get_diagnosis_ids(record))
# check if diagnosis is a subset of one of the categories
for category_name, category_codes in categories.items():
# if any of the diagnosis codes is in the category_codes
if any(i in category_codes for i in diagnosis):
diag_dict[category_name].append(record)
break
# Increment the counter of how many records we have read
counter += 1
counter_bool = counter >= max_counter
# Break the loop if we have read max_counter records
if counter % 100 == 0:
print(f"Read {counter} records")
if counter_bool:
break
# Increment the counter
counter += 1
counter_bool = counter >= max_counter
# Break the loop if we have read max_counter records
if counter % 100 == 0:
print(f"Read {counter} records")
if counter_bool:
break
except Exception as e:
print(f"Failed to read record {record_name} due to ValueError")
except Exception as e:
failed_records.append(record_name)
print(f"Failed to read record {record_name} due to ValueError. Sum of failed records: {len(failed_records)}")
if counter_bool:
break
if counter_bool:
break
if counter_bool:
break
"""
if any(i in category_codes for i in diagnosis):
ID: SB, Count: 16559
ID: AFIB, Count: 9839
ID: GSVT, Count: 948
ID: SR, Count: 9720
break
Der Counter gibt an ob eine Diagnose in einer Kategorie ist
---------------------------------------------------------------------------------------------------------------------
set(diagnosis).issubset(set(category_codes)):
ID: SB, Count: 8909
ID: AFIB, Count: 1905
ID: GSVT, Count: 431
ID: SR, Count: 7299
break
Der Counter gibt an ob alle Diagnosen in einer Kategorie sind
"""
# for id, count in diag_dict.items():
# print(f"ID: {id}, Count: {count}")
# write to pickle
for cat_name, records in diag_dict.items():
print(f"Writing {cat_name} to pickle with {len(records)} records")
# if path not exists create it
if not os.path.exists('./data'):
os.makedirs('./data')
with open(f'./data/{cat_name}.pkl', 'wb') as f:
pickle.dump(records, f)
# write to pickle
for cat_name, records in diag_dict.items():
print(f"Writing {cat_name} to pickle with {len(records)} records")
# if path not exists create it
if not os.path.exists('./data'):
os.makedirs('./data')
with open(f'./data/{cat_name}.pkl', 'wb') as f:
pickle.dump(records, f)