130 lines
8.1 KiB
Markdown
130 lines
8.1 KiB
Markdown
# HSMA Data Science and Analytics SS2024
|
|
|
|
This project was developed through the Data Science and Analytics course at the Mannheim University of Applied Sciences. A data science cycle was taught theoretically on the basis of lectures and implemented practically in the project.
|
|
|
|
# Analysis of cardiovascular diseases using ECG data
|
|
|
|
## Table of Contents
|
|
- [About](#about)
|
|
- [Getting Started](#getting-started)
|
|
- [Usage](#usage)
|
|
- [Progress](#progress)
|
|
- [Contributing](#contributing)
|
|
- [License](#license)
|
|
- [Acknowledgements](#acknowledgements)
|
|
- [Contact](#contact)
|
|
|
|
|
|
## About
|
|
|
|
Cardiovascular diseases refer to a group of diseases that affect the heart and blood vessels and represent a significant global health burden. They are a leading cause of morbidity and mortality worldwide, making effective prevention and management of these diseases critical. Physical examinations, blood tests, ECGs, stress or exercise tests, echocardiograms and CT or MRI scans are used to diagnose cardiovascular disease.
|
|
(source: https://www.netdoktor.de/krankheiten/herzkrankheiten/, last visit: 15.05.2024)
|
|
|
|
|
|
An electrocardiogram (ECG) is a method of recording the electrical activity of the heart over a certain period of time. As an important diagnostic technique in cardiology, it is used to detect cardiac arrhythmias, heart attacks and other cardiovascular diseases. The ECG displays this electrical activity as waves and lines on a paper or screen. According to current screening and diagnostic practices, either cardiologists or physicians review the ECG data, determine the correct diagnosis and begin implementing subsequent treatment plans such as medication regimens and radiofrequency catheter ablation.
|
|
(https://flexikon.doccheck.com/de/Elektrokardiogramm, last visit: 15.05.2024)
|
|
|
|
|
|
The project uses a dataset from a 12-lead electrocardiogram database published in August 2022. The database was developed under the auspices of Chapman University, Shaoxing People's Hospital and Ningbo First Hospital to support research on arrhythmias and other cardiovascular diseases. The dataset contains detailed data from 45,152 patients, recorded at a sampling rate of 500 Hz, and includes several common rhythms as well as additional cardiovascular conditions. The diagnoses are divided into four main categories: SB (sinus bradycardia), AFIB (atrial fibrillation and atrial flutter), GSVT (supraventricular tachycardia) and SR (sinus rhythm and sinus irregularities). The ECG data was stored in the GE MUSE ECG system and exported to XML files. A conversion tool was developed to convert the data to CSV format, which was later converted to WFDB format.
|
|
(source: https://doi.org/10.13026/wgex-er52, last visit: 15.05.2024)
|
|
|
|
|
|
The data set used in this project was divided into four main groups: SB, AFIB, GSVT and SR. The choice of these groups is based on the results from the paper “Optimal Multi-Stage Arrhythmia Classification Approach” by Jianwei Zheng, Huimin Chu et al., this choice in turn is based on expert opinions from 11 physicians. Each group represents different cardiac arrhythmias that can be identified by electrocardiographic (ECG) features.
|
|
(source: https://rdcu.be/dH2jI, last visit: 15.05.2024)
|
|
|
|
|
|
The data provision provides for the following points, which can be taken from the diagram.
|
|
|
|
|
|
![Alt-Text](readme_data/Projektablauf.drawio.png)
|
|
|
|
|
|
## Getting Started
|
|
This project was implemented in Python. To use the project, all packages listed in the requirements.txt file need to be installed first. After that, you can interact with the project as follows:
|
|
|
|
1. Ensure you have 10GB of available space.
|
|
2. First, visit the website and download the dataset (https://doi.org/10.13026/wgex-er52, last visit: 15.05.2024).
|
|
3. Extract the data.
|
|
4. Open the generate_data.py script and adjust the "project_dir" path to point to the downloaded data.
|
|
5. Run the generate_data.py script as the main file. This will generate several pickle files, which may take some time.
|
|
6. You can now use the notebooks by adjusting the "path" variable in the top lines of each notebook to point to the pickle files.
|
|
|
|
## Usage
|
|
- coming at the end of the Project...
|
|
|
|
## Progress
|
|
- Data was searched and found at : (https://doi.org/10.13026/wgex-er52, last visit: 15.05.2024)
|
|
- Data was cleaned
|
|
- Demographic data was plotted
|
|
- Hypotheses put forward
|
|
- Start exploring signal processing
|
|
|
|
|
|
### Demographic plots
|
|
|
|
#### Histogram
|
|
The following histogram shows the age distribution. It illustrates the breakdown of the grouped diagnoses by age group as well as the absolute frequencies of the diagnoses.
|
|
|
|
The exact procedure for creating the histogram can be found in the notebook [demographic_plots.ipynb](notebooks/demographic_plots.ipynb).
|
|
|
|
|
|
![Alt-Text](readme_data/Histogramm.png)
|
|
|
|
#### Correlation matrix
|
|
|
|
The following figure shows a correlation matrix of age groups and diagnoses. This matrix describes the four diagnosis groupings on the horizontal axis and the age groupings in decade increments on the vertical axis.
|
|
|
|
The colour scale represents the correlation between the two types of categorization:
|
|
|
|
- Blue (low)
|
|
- Red (high)
|
|
|
|
The exact procedure for creating the matrix can be found in the notebook [demographic_plots.ipynb](notebooks/demographic_plots.ipynb).
|
|
|
|
![Alt-Text](readme_data/Korrelationsmatrix.png)
|
|
|
|
|
|
|
|
#### Hypotheses
|
|
|
|
The following two hypotheses were applied in this project:
|
|
|
|
- Using ECG data, a classifier can classify the four disease groupings with an accuracy of 80%.
|
|
|
|
- Sinus bradycardia occurs significantly more frequently in the 60 to 70 age group than in other age groups.
|
|
|
|
|
|
The second hypothesis was tested for significance using the chi-square test. The detailed procedure can be found in the following notebook: [statistics.ipynb](notebooks/statistics.ipynb)
|
|
|
|
Result:
|
|
|
|
- The first value returned is the Chi-Square Statistic that shows the difference between the observed and the expected frequencies. Here, a bigger number indicates a bigger difference. The p-value shows the probability of this difference being statistically significant. If the p-value is below the significance level of 0.05, the difference is significant.
|
|
|
|
- The Chi-Square Statistic for sinus bradycardia in the age group 60-70 compared to the other age groups, is a value that shows whether there is a significant difference in the frequency of sinus bradycardia in the age group 60-70 in comparison to the other age groups. If the p-value is smaller than the significance level of 0.05, the difference in the frequency between the age group 60-70 and the other age groups is significant.
|
|
|
|
|
|
|
|
The significant appearance of sinus bradycardia in the age group 60-70 could be caused by multiple factors.
|
|
In this case the physiological age could play a huge factor. The sinus node continuously generates electrical impulses, thus setting the normal rhythm and rate in a healthy heart. With increasing age, the sinus node becomes less responsive which leads to a slower heart rate of 60 bpm or less.
|
|
Another reason could be increased medication, which is more likely to be the case when older. A sinus bradycardia could appear as a side effect of that medication.
|
|
|
|
What could be the reason for the more frequent appearance of the sinus bradycardia in the age group 60-70 than in other older age groups?
|
|
The lower number of sinus bradycardia cases in older age groups could be due to the increasing mortality with higher ages. People with sinus bradycardia might not reach older ages because of comorbidities and further complications.
|
|
Besides that, older people are more likely to receive medical support such as medication and pacemakers which will prevent sinus bradycardia or at least lower its effect.
|
|
The sample size in the study conducted may also play a role in the significance of the frequency.
|
|
|
|
## Contributing
|
|
- coming at the end of the Project...
|
|
|
|
## License
|
|
This project is licensed under the [MIT License](https://opensource.org/licenses/MIT).
|
|
|
|
|
|
## Acknowledgements
|
|
We would like to especially thank our instructor, Ms. Jacqueline Franßen, for her enthusiastic support in helping us realize this project.
|
|
|
|
## Contact
|
|
- Klara Tabea Bracke (3015256@hs-mannheim.de)
|
|
- Arman Ulusoy (3016148@stud.hs-mannheim.de)
|
|
- Nils Rekus (1826514@stud.hs-mannheim.de)
|
|
- Felix Jan Michael Mucha (felixjanmichael.mucha@stud.hs-mannheim.de) |