diff --git a/README.md b/README.md
index c4bd36c..1287043 100644
--- a/README.md
+++ b/README.md
@@ -16,7 +16,7 @@ This project was developed through the Data Science and Analytics course at the
## About
-(version 12.06)
+(as of 12.06)
Cardiovascular diseases refer to a group of diseases that affect the heart and blood vessels and represent a significant global health burden. They are a leading cause of morbidity and mortality worldwide, making effective prevention and management of these diseases critical. Physical examinations, blood tests, ECGs, stress or exercise tests, echocardiograms and CT or MRI scans are used to diagnose cardiovascular disease.
(source: https://www.netdoktor.de/krankheiten/herzkrankheiten/, last visit: 15.05.2024)
@@ -42,16 +42,16 @@ The data provision provides for the following points, which can be taken from th
## Getting Started
-(version 12.06)
+(as of 12.06)
This project is implemented in Python. Follow these steps to set up and use the project:
### Prerequisites
-(version 12.06)
+(as of 12.06)
- Ensure you have Python 3.8 or newer installed on your system.
- At least `10 GB` of available disk space and `32 GB` of RAM are recommended for optimal performance.
### Installation
-(version 12.06)
+(as of 12.06)
1. **Download the Dataset:**
- Visit [the dataset page](https://doi.org/10.13026/wgex-er52) (last visited: 15.05.2024) and download the dataset.
- Extract the dataset to a known directory on your system.
@@ -65,7 +65,7 @@ This project is implemented in Python. Follow these steps to set up and use the
- Adjust the parameters as needed, especially the path variables to match where you extracted the dataset.
### Generating Data
-(version 12.06)
+(as of 12.06)
1. **Generate Basic Data Files:**
- In the terminal, ensure you are in the project directory.
- Run `generate_data.py` `main-function` with the folloing parameters `gen_data=True` `gen_features=False` to generate several pickle files. This process may take some time.
@@ -114,20 +114,20 @@ Through this process, Emma was able to leverage our project to generate meaningf
## Progress
- **Data was searched and found at : (https://doi.org/10.13026/wgex-er52, last visit: 15.05.2024)**
-- **[Data was cleaned](#data-cleaning)** (version 12.06)
-- **[Demographic data was plotted](#demographic-plots)** (version 12.06)
-- **[Hypotheses put forward](#hypotheses)** (version 12.06 & 03.07)
-- **[Noise reduction](#noise-reduction)** (version 12.06)
-- **[Features](#features)**(version 12.06)
-- **[ML-models](#ml-models)** (version 12.06 & 03.07)
-- **[Cluster analysis](#cluster-analysis)** (version 03.07)
-- **[Legal basis](#legal-basis)** (version 03.07)
-- **[Conclusion](#conclusion)** (version 03.07)
-- **[Outlook](#outlook)** (version 03.07)
+- **[Data was cleaned](#data-cleaning)** (as of 12.06)
+- **[Demographic data was plotted](#demographic-plots)** (as of 12.06)
+- **[Hypotheses put forward](#hypotheses)** (as of 12.06 & 03.07)
+- **[Noise reduction](#noise-reduction)** (as of 12.06)
+- **[Features](#features)**(as of 12.06)
+- **[ML-models](#ml-models)** (as of 12.06 & 03.07)
+- **[Cluster analysis](#cluster-analysis)** (as of 03.07)
+- **[Legal basis](#legal-basis)** (as of 03.07)
+- **[Conclusion](#conclusion)** (as of 03.07)
+- **[Outlook](#outlook)** (as of 03.07)
### Data cleaning
-(version 12.06)
+(as of 12.06)
The following criteria were checked to ensure data quality:
@@ -137,7 +137,7 @@ The following criteria were checked to ensure data quality:
- Number of data records that could not be read in
### Demographic plots
-(version 12.06)
+(as of 12.06)
#### Histogram
The following histogram shows the age distribution. It illustrates the breakdown of the grouped diagnoses by age group as well as the absolute frequencies of the diagnoses.
@@ -165,7 +165,7 @@ The exact procedure for creating the matrix can be found in the notebook [demogr
### Hypotheses
-(version 03.07.)
+(as of 03.07.)
The following two hypotheses were applied in this project:
@@ -174,9 +174,9 @@ The following two hypotheses were applied in this project:
1. Using ECG data, a classifier can classify the four diagnostic groupings with an accuracy of at least 80%.
Result:
-- For the first hypothesis, an accuracy of 83 % was achieved with the XGBoost classifier. The detailed procedure can be found in the following notebook: [ml_xgboost.ipynb](notebooks/ml_xgboost.ipynb) (version 12.06)
-- Also a 82 % accuracy was achieved with a Gradient Boosting Tree Classifier. The detailed procedure can be found in the following notebook: [ml_grad_boost_tree.ipynb](notebooks/ml_grad_boost_tree.ipynb) (version 12.06)
-- An 80 % accuracy was achieved with a Decision Tree Classifier. The detailed procedure can be found in the following notebook: [ml_decision_tree.ipynb](notebooks/ml_decision_tree.ipynb) (version 03.07)
+- For the first hypothesis, an accuracy of 83 % was achieved with the XGBoost classifier. The detailed procedure can be found in the following notebook: [ml_xgboost.ipynb](notebooks/ml_xgboost.ipynb) (as of 12.06)
+- Also a 82 % accuracy was achieved with a Gradient Boosting Tree Classifier. The detailed procedure can be found in the following notebook: [ml_grad_boost_tree.ipynb](notebooks/ml_grad_boost_tree.ipynb) (as of 12.06)
+- An 80 % accuracy was achieved with a Decision Tree Classifier. The detailed procedure can be found in the following notebook: [ml_decision_tree.ipynb](notebooks/ml_decision_tree.ipynb) (as of 03.07)
With those Classifiers, the hypothesis can be proven, that a classifier is able to classify the diagnostic Groups with a accuracy of at least 80%.
@@ -212,14 +212,14 @@ The significant appearance of atrial fibrillation/atrial flutter in the age grou
The physiological age is the main reason. With increasing age, various age-related changes in the cardiovascular system occur. Older people are more likely to have hypertension. The increased pressure can lead to thickening of the heart walls and a change of the structure, potentially leading to AFIB. Chronic inflammation which is more prevalent in older people, can damage heart tissue and lead to atrial issues. The change of hormone levels when getting older can also have an influence on the heart function and contribute to the development of arrhythmias. Older adults are also more likely to have comorbidities like diabetes, obesity or chronic kidney disease.
(source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5460064/, last visit: 28.06.2024)
### Noise reduction
-(version 12.06)
+(as of 12.06)
Noise suppression was performed on the existing ECG data. A three-stage noise reduction was performed to reduce the noise in the ECG signals. First, a Butterworth filter was applied to the signals to remove the high frequency noise. Then a Loess filter was applied to the signals to remove the low frequency noise. Finally, a non-local-means filter was applied to the signals to remove the remaining noise. For noise reduction, the built-in noise reduction function from NeuroKit2 `ecg_clean` was utilized for all data due to considerations of time performance.
How the noise reduction was performed in detail can be seen in the following notebook: [noise_reduction.ipynb](notebooks/noise_reduction.ipynb)
### Features
-(version 12.06)
+(as of 12.06)
The detection ability of the NeuroKit2 library is tested to detect features in the ECG dataset. Those features are important for the training of the model in order to detect the different diagnostic groups. The features are detected using the NeuroKit2 library.
For the training, the features considered are:
@@ -242,13 +242,13 @@ The exact process can be found in the notebook: [features_detection.ipynb](noteb
For machine learning, the initial step involved tailoring the features for the models, followed by employing a grid search to identify the best hyperparameters. This approach led to the highest performance being achieved by the Extreme Gradient Boosting (XGBoost) model, which attained an accuracy of 83%. Additionally, a Gradient Boosting Tree model was evaluated using the same procedure and achieved an accuracy of 82%. A Decision Tree model was also evaluated, having the lowest performance of 80%. The selection of these models was influenced by the team's own experience and the performance metrics highlighted in the paper (source: https://rdcu.be/dH2jI, last accessed: 15.05.2024). The models have also been evaluated, and it is noticeable that some features, like the ventricular rate, are shown to be more important than other features.
The detailed procedures can be found in the following notebooks:
-
[ml_xgboost.ipynb](notebooks/ml_xgboost.ipynb)(version 12.06)
-
[ml_grad_boost_tree.ipynb](notebooks/ml_grad_boost_tree.ipynb) (version 12.06)
-
[ml_decision_tree.ipynb](notebooks/ml_decision_tree.ipynb) (version 03.07)
+
[ml_xgboost.ipynb](notebooks/ml_xgboost.ipynb)(as of 12.06)
+
[ml_grad_boost_tree.ipynb](notebooks/ml_grad_boost_tree.ipynb) (as of 12.06)
+
[ml_decision_tree.ipynb](notebooks/ml_decision_tree.ipynb) (as of 03.07)
## Cluster-analysis
-(version 03.07)
+(as of 03.07)
To enhance our understanding of the feature clusters and their similarity to the original data labels, we initiated our analysis by preparing the data. This preparation involved normalization and imputation of missing values with mean substitution. Subsequently, we employed the K-Means algorithm for distance-based clustering of the data. Our exploration focused on comparing two types of labels: those derived from the K-Means algorithm and the original dataset labels. Various visualizations were generated to facilitate this comparison.
@@ -273,7 +273,7 @@ Further analysis included the creation of a Euclidean distance matrix plot to vi
[cluster_features.ipynb](notebooks/cluster_features.ipynb)
## Legal Basis and Data Biases
-(version 03.07)
+(as of 03.07)
### Local Bias
- The dataset originates exclusively from one hospital, encompassing contributions from Chapman University, Shaoxing People’s Hospital (affiliated with Shaoxing Hospital Zhejiang University School of Medicine), and Ningbo First Hospital. This may introduce a local bias, as all data are collected from a specific geographic and institutional context.
@@ -287,12 +287,12 @@ Further analysis included the creation of a Euclidean distance matrix plot to vi
This indicates a potential demographic bias towards older age groups and a gender imbalance.
## Data protection and ethics
-(version 03.07)
+(as of 03.07)
The data used in the project was approved by the review boards of Shaoxing People's Hospital and Ningbo First Hospital of Zhejiang University. Both institutions allowed public disclosure of the data after de-identification. While Shaoxing People's Hospital additionally waived the informed consent requirement, Ningbo First Hospital also did not require patient consent.
## Conclusion
-(version 03.07)
+(as of 03.07)
This project has impressively demonstrated the feasibility and benefits of applying modern data analysis methods and machine learning in the field of cardiology. By using a large dataset of 12-lead ECGs, it was possible to effectively classify different cardiac arrhythmias using models such as XGBoost, gradient boosting and decision trees. These models achieved classification accuracies of and above 80%, highlighting the importance of accurate diagnostic tools.
@@ -305,7 +305,7 @@ Ultimately, our research shows that the continued integration and improvement of
## Outlook
-(version 03.07)
+(as of 03.07)
As data science advances, there are several opportunities to improve and expand current research on analyzing cardiovascular disease using ECG data.
Key future directions include:
@@ -329,7 +329,7 @@ Key future directions include:
## Contributing
-(version 12.06)
+(as of 12.06)
Thank you for your interest in contributing to our project! As an open-source project, we welcome contributions from everyone. Here are some ways you can contribute: