main
klara 2024-06-26 15:53:24 +02:00
parent ee9d66e3b4
commit 007c121edb
1 changed files with 54 additions and 38 deletions

View File

@ -16,6 +16,7 @@ This project was developed through the Data Science and Analytics course at the
## About ## About
(version 12.06)
Cardiovascular diseases refer to a group of diseases that affect the heart and blood vessels and represent a significant global health burden. They are a leading cause of morbidity and mortality worldwide, making effective prevention and management of these diseases critical. Physical examinations, blood tests, ECGs, stress or exercise tests, echocardiograms and CT or MRI scans are used to diagnose cardiovascular disease. Cardiovascular diseases refer to a group of diseases that affect the heart and blood vessels and represent a significant global health burden. They are a leading cause of morbidity and mortality worldwide, making effective prevention and management of these diseases critical. Physical examinations, blood tests, ECGs, stress or exercise tests, echocardiograms and CT or MRI scans are used to diagnose cardiovascular disease.
(source: https://www.netdoktor.de/krankheiten/herzkrankheiten/, last visit: 15.05.2024) (source: https://www.netdoktor.de/krankheiten/herzkrankheiten/, last visit: 15.05.2024)
@ -41,13 +42,16 @@ The data provision provides for the following points, which can be taken from th
## Getting Started ## Getting Started
(version 12.06)
This project is implemented in Python. Follow these steps to set up and use the project: This project is implemented in Python. Follow these steps to set up and use the project:
### Prerequisites ### Prerequisites
(version 12.06)
- Ensure you have Python 3.8 or newer installed on your system. - Ensure you have Python 3.8 or newer installed on your system.
- At least `10 GB` of available disk space and `32 GB` of RAM are recommended for optimal performance. - At least `10 GB` of available disk space and `32 GB` of RAM are recommended for optimal performance.
### Installation ### Installation
(version 12.06)
1. **Download the Dataset:** 1. **Download the Dataset:**
- Visit [the dataset page](https://doi.org/10.13026/wgex-er52) (last visited: 15.05.2024) and download the dataset. - Visit [the dataset page](https://doi.org/10.13026/wgex-er52) (last visited: 15.05.2024) and download the dataset.
- Extract the dataset to a known directory on your system. - Extract the dataset to a known directory on your system.
@ -61,6 +65,7 @@ This project is implemented in Python. Follow these steps to set up and use the
- Adjust the parameters as needed, especially the path variables to match where you extracted the dataset. - Adjust the parameters as needed, especially the path variables to match where you extracted the dataset.
### Generating Data ### Generating Data
(version 12.06)
1. **Generate Basic Data Files:** 1. **Generate Basic Data Files:**
- In the terminal, ensure you are in the project directory. - In the terminal, ensure you are in the project directory.
- Run `generate_data.py` `main-function` with the folloing parameters `gen_data=True` `gen_features=False` to generate several pickle files. This process may take some time. - Run `generate_data.py` `main-function` with the folloing parameters `gen_data=True` `gen_features=False` to generate several pickle files. This process may take some time.
@ -115,9 +120,12 @@ Through this process, Emma was able to leverage our project to generate meaningf
- **[Noise reduction](#noise-reduction)** - **[Noise reduction](#noise-reduction)**
- **[Features](#features)** - **[Features](#features)**
- **[ML-models](#ml-models)** - **[ML-models](#ml-models)**
- **[Cluster analysis](#cluster-analysis)**
- **[Legal basis](#legal-basis)
### Data cleaning ### Data cleaning
(version 12.06)
The following criteria were checked to ensure data quality: The following criteria were checked to ensure data quality:
@ -127,6 +135,7 @@ The following criteria were checked to ensure data quality:
- Number of data records that could not be read in - Number of data records that could not be read in
### Demographic plots ### Demographic plots
(version 12.06)
#### Histogram #### Histogram
The following histogram shows the age distribution. It illustrates the breakdown of the grouped diagnoses by age group as well as the absolute frequencies of the diagnoses. The following histogram shows the age distribution. It illustrates the breakdown of the grouped diagnoses by age group as well as the absolute frequencies of the diagnoses.
@ -161,9 +170,9 @@ The following two hypotheses were applied in this project:
1. Using ECG data, a classifier can classify the four diagnostic groupings with an accuracy of at least 80%. 1. Using ECG data, a classifier can classify the four diagnostic groupings with an accuracy of at least 80%.
Result: Result:
- For the first hypothesis, an accuracy of 83 % was achieved with the XGBoost classifier. The detailed procedure can be found in the following notebook: [ml_xgboost.ipynb](notebooks/ml_xgboost.ipynb) - For the first hypothesis, an accuracy of 83 % was achieved with the XGBoost classifier. The detailed procedure can be found in the following notebook: [ml_xgboost.ipynb](notebooks/ml_xgboost.ipynb) (version 12.06)
- Also a 82 % accuracy was achieved with a Gradient Boosting Tree Classifier. The detailed procedure can be found in the following notebook: [ml_grad_boost_tree.ipynb](notebooks/ml_grad_boost_tree.ipynb) - Also a 82 % accuracy was achieved with a Gradient Boosting Tree Classifier. The detailed procedure can be found in the following notebook: [ml_grad_boost_tree.ipynb](notebooks/ml_grad_boost_tree.ipynb) (version 12.06)
- An 80 % accuracy was achieved with a Decision Tree Classifier. The detailed procedure can be found in the following notebook: [ml_decision_tree.ipynb](notebooks/ml_decision_tree.ipynb) - An 80 % accuracy was achieved with a Decision Tree Classifier. The detailed procedure can be found in the following notebook: [ml_decision_tree.ipynb](notebooks/ml_decision_tree.ipynb) (version 03.07)
With those Classifiers, the hypothesis can be proven, that a classifier is able to classify the diagnostic Groups with a accuracy of at least 80%. With those Classifiers, the hypothesis can be proven, that a classifier is able to classify the diagnostic Groups with a accuracy of at least 80%.
@ -172,7 +181,7 @@ With those Classifiers, the hypothesis can be proven, that a classifier is able
2. Sinus bradycardia occurs significantly more frequently in the 60 to 70 age group than in other age groups. 2. Sinus bradycardia occurs significantly more frequently in the 60 to 70 age group than in other age groups.
The second hypothesis was tested for significance using the chi-square test. The detailed procedure can be found in the following notebook: [statistics.ipynb](notebooks/statistics.ipynb) The second hypothesis was tested for significance using the chi-square test. The detailed procedure can be found in the following notebook: [statistics.ipynb](notebooks/statistics.ipynb) (version 12.06)
Result: Result:
@ -195,11 +204,13 @@ With those Classifiers, the hypothesis can be proven, that a classifier is able
The sample size in the study conducted may also play a role in the significance of the frequency. The sample size in the study conducted may also play a role in the significance of the frequency.
### Noise reduction ### Noise reduction
(version 12.06)
Noise suppression was performed on the existing ECG data. A three-stage noise reduction was performed to reduce the noise in the ECG signals. First, a Butterworth filter was applied to the signals to remove the high frequency noise. Then a Loess filter was applied to the signals to remove the low frequency noise. Finally, a non-local-means filter was applied to the signals to remove the remaining noise. For noise reduction, the built-in noise reduction function from NeuroKit2 `ecg_clean` was utilized for all data due to considerations of time performance. Noise suppression was performed on the existing ECG data. A three-stage noise reduction was performed to reduce the noise in the ECG signals. First, a Butterworth filter was applied to the signals to remove the high frequency noise. Then a Loess filter was applied to the signals to remove the low frequency noise. Finally, a non-local-means filter was applied to the signals to remove the remaining noise. For noise reduction, the built-in noise reduction function from NeuroKit2 `ecg_clean` was utilized for all data due to considerations of time performance.
How the noise reduction was performed in detail can be seen in the following notebook: [noise_reduction.ipynb](notebooks/noise_reduction.ipynb) How the noise reduction was performed in detail can be seen in the following notebook: [noise_reduction.ipynb](notebooks/noise_reduction.ipynb)
### Features ### Features
(version 12.06)
The detection ability of the NeuroKit2 library is tested to detect features in the ECG dataset. Those features are important for the training of the model in order to detect the different diagnostic groups. The features are detected using the NeuroKit2 library. The detection ability of the NeuroKit2 library is tested to detect features in the ECG dataset. Those features are important for the training of the model in order to detect the different diagnostic groups. The features are detected using the NeuroKit2 library.
For the training, the features considered are: For the training, the features considered are:
@ -222,12 +233,37 @@ The exact process can be found in the notebook: [features_detection.ipynb](noteb
For machine learning, the initial step involved tailoring the features for the models, followed by employing a grid search to identify the best hyperparameters. This approach led to the highest performance being achieved by the Extreme Gradient Boosting (XGBoost) model, which attained an accuracy of 83%. Additionally, a Gradient Boosting Tree model was evaluated using the same procedure and achieved an accuracy of 82%. A Decision Tree model was also evaluated, having the lowest performance of 80%. The selection of these models was influenced by the team's own experience and the performance metrics highlighted in the paper (source: https://rdcu.be/dH2jI, last accessed: 15.05.2024). The models have also been evaluated, and it is noticeable that some features, like the ventricular rate, are shown to be more important than other features. For machine learning, the initial step involved tailoring the features for the models, followed by employing a grid search to identify the best hyperparameters. This approach led to the highest performance being achieved by the Extreme Gradient Boosting (XGBoost) model, which attained an accuracy of 83%. Additionally, a Gradient Boosting Tree model was evaluated using the same procedure and achieved an accuracy of 82%. A Decision Tree model was also evaluated, having the lowest performance of 80%. The selection of these models was influenced by the team's own experience and the performance metrics highlighted in the paper (source: https://rdcu.be/dH2jI, last accessed: 15.05.2024). The models have also been evaluated, and it is noticeable that some features, like the ventricular rate, are shown to be more important than other features.
<br>The detailed procedures can be found in the following notebooks: <br>The detailed procedures can be found in the following notebooks:
<br>[ml_xgboost.ipynb](notebooks/ml_xgboost.ipynb) <br>[ml_xgboost.ipynb](notebooks/ml_xgboost.ipynb)(version 12.06)
<br>[ml_grad_boost_tree.ipynb](notebooks/ml_grad_boost_tree.ipynb) <br>[ml_grad_boost_tree.ipynb](notebooks/ml_grad_boost_tree.ipynb) (version 12.06)
<br>[ml_decision_tree.ipynb](notebooks/ml_decision_tree.ipynb) <br>[ml_decision_tree.ipynb](notebooks/ml_decision_tree.ipynb) (version 03.07)
### Cluster-analysis
(version 03.07)
[](notebooks/...)
## Legal basis
(version 03.07)
- The data used all come from one hospital
- Most of the data are from people of older age, predominantly from the 60-70 age group
## Conclusion
(03.07)
- Machine learning and data analysis as valuable tools for investigating cardiovascular diseases
- Improvement of diagnostics and treatment possible through predictive modeling
## Outlook into the future
(03.07)
- In der Zukunft weitere Modelle anwenden und verbessern
-
-
## Contributing ## Contributing
(version 12.06)
Thank you for your interest in contributing to our project! As an open-source project, we welcome contributions from everyone. Here are some ways you can contribute: Thank you for your interest in contributing to our project! As an open-source project, we welcome contributions from everyone. Here are some ways you can contribute:
- **Reporting Bugs:** If you find a bug, please open an issue on our GitHub page with a detailed description of the bug, steps to reproduce it, and any other relevant information that could help us fix it. - **Reporting Bugs:** If you find a bug, please open an issue on our GitHub page with a detailed description of the bug, steps to reproduce it, and any other relevant information that could help us fix it.
@ -246,26 +282,6 @@ Please note, by contributing to this project, you agree that your contributions
We look forward to your contributions. Thank you for helping us improve this project! We look forward to your contributions. Thank you for helping us improve this project!
## Legal basis (03.07)
- The data used all come from one hospital
- Most of the data are from people of older age, predominantly from the 60-70 age group
## What was expanded? (03.07)
- In addition to the Gradient Tree and Extreme Gradient Boosting models, the Decision Tree model was used, which is explained in more detail in the ["ML models"](#ml-models) section
- Cluster analysis
- Grafik Nils?
## Conclusion (03.07)
- Machine learning and data analysis as valuable tools for investigating cardiovascular diseases
- Improvement of diagnostics and treatment possible through predictive modeling
## Outlook into the future (03.07)
- In der Zukunft weitere Modelle anwenden und verbessern
-
-
## License ## License
This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). This project is licensed under the [MIT License](https://opensource.org/licenses/MIT).