README.md aktualisiert

2024-06-30 23:23:27 +02:00 · 2024-06-30 23:23:27 +02:00 · 09a24a2866
parent 17b61e83af
commit 09a24a2866
1 changed files with 25 additions and 25 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # HSMA Data Science and Analytics SS2024 

-This project was developed through the Data Science and Analytics course at the Mannheim University of Applied Sciences. A data science cycle was taught theoretically on the basis of lectures and implemented practically in the project. 
+This project was developed as part of the Data Science and Analytics course at the Mannheim University of Applied Sciences. A data science cycle was taught theoretically in lectures and implemented practically in the project. 

 # Analysis of cardiovascular diseases using ECG data

@ -18,19 +18,19 @@ This project was developed through the Data Science and Analytics course at the
 ## About 
 (as of  12.06)

-Cardiovascular diseases refer to a group of diseases that affect the heart and blood vessels and represent a significant global health burden. They are a leading cause of morbidity and mortality worldwide, making effective prevention and management of these diseases critical. Physical examinations, blood tests, ECGs, stress or exercise tests, echocardiograms and CT or MRI scans are used to diagnose cardiovascular disease.
+Cardiovascular diseases are a group of diseases that affect the heart and blood vessels and represent a significant global health burden. They are a leading cause of morbidity and mortality worldwide, making effective prevention and management of these diseases critical. Physical examinations, blood tests, ECGs, stress or exercise tests, echocardiograms and CT or MRI scans are used to diagnose cardiovascular disease.
 (source: https://www.netdoktor.de/krankheiten/herzkrankheiten/, last visit: 15.05.2024)


-An electrocardiogram (ECG) is a method of recording the electrical activity of the heart over a certain period of time. As an important diagnostic technique in cardiology, it is used to detect cardiac arrhythmias, heart attacks and other cardiovascular diseases. The ECG displays this electrical activity as waves and lines on a paper or screen. According to current screening and diagnostic practices, either cardiologists or physicians review the ECG data, determine the correct diagnosis and begin implementing subsequent treatment plans such as medication regimens and radiofrequency catheter ablation.
+An electrocardiogram (ECG) is a method of recording the electrical activity of the heart over a certain period of time. As an important diagnostic technique in cardiology, it is used to detect cardiac arrhythmias, heart attacks and other cardiovascular diseases. The ECG displays this electrical activity as waves and lines on paper or on a screen. According to current screening and diagnostic practices, either cardiologists or physicians review the ECG data, determine the correct diagnosis and begin implementing subsequent treatment plans such as medication regimens and radiofrequency catheter ablation.
 (https://flexikon.doccheck.com/de/Elektrokardiogramm, last visit: 15.05.2024)


-The project uses a dataset from a 12-lead electrocardiogram database published in August 2022. The database was developed under the auspices of Chapman University, Shaoxing People's Hospital and Ningbo First Hospital to support research on arrhythmias and other cardiovascular diseases. The dataset contains detailed data from 45,152 patients, recorded at a sampling rate of 500 Hz, and includes several common rhythms as well as additional cardiovascular conditions. The diagnoses are divided into four main categories: SB (sinus bradycardia), AFIB (atrial fibrillation and atrial flutter), GSVT (supraventricular tachycardia) and SR (sinus rhythm and sinus irregularities). The ECG data was stored in the GE MUSE ECG system and exported to XML files. A conversion tool was developed to convert the data to CSV format, which was later converted to WFDB format. 
+The project uses a dataset from a 12-lead electrocardiogram database published in August 2022. The database was developed under the auspices of Chapman University, Shaoxing People's Hospital and Ningbo First Hospital to support research on arrhythmias and other cardiovascular diseases. The dataset contains detailed data from 45,152 patients, recorded at a sampling rate of 500 Hz, and includes several common rhythms as well as additional cardiovascular conditions. The diagnoses are grouped into four main categories: SB (sinus bradycardia), AFIB (atrial fibrillation and atrial flutter), GSVT (supraventricular tachycardia) and SR (sinus rhythm and sinus irregularities). The ECG data was stored in the GE MUSE ECG system and exported to XML files. A conversion tool was developed to convert the data to CSV format, which was later converted to WFDB format. 
 (source: https://doi.org/10.13026/wgex-er52, last visit: 15.05.2024)


-The data set used in this project was divided into four main groups: SB, AFIB, GSVT and SR. The choice of these groups is based on the results from the paper “Optimal Multi-Stage Arrhythmia Classification Approach” by Jianwei Zheng, Huimin Chu et al., this choice in turn is based on expert opinions from 11 physicians. Each group represents different cardiac arrhythmias that can be identified by electrocardiographic (ECG) features. 
+The dataset used in this project was divided into four main groups: SB, AFIB, GSVT and SR. The choice of these groups is based on the results from the paper “Optimal Multi-Stage Arrhythmia Classification Approach” by Jianwei Zheng, Huimin Chu et al., this choice in turn is based on expert opinions from 11 physicians. Each group represents different cardiac arrhythmias that can be identified by electrocardiographic (ECG) features. 
 (source: https://rdcu.be/dH2jI, last visit: 15.05.2024)


@ -86,9 +86,9 @@ Let's walk through a user story to illustrate how to use our project, incorporat
 **Emma**, a health data analyst, is keen on exploring the relationship between ECG Signals and health outcomes. She decides to use our project for her analysis. Here's how she proceeds:

 1. **Preparation:**
-   - Emma checks that her computer has at least 10GB of free space and 32GB of RAM.
+   - Emma makes sure that her computer has at least 10GB of free space and 32GB of RAM.
   - She visits the dataset page (https://doi.org/10.13026/wgex-er52, last visited: 15.05.2024) and downloads the dataset.
-   - After downloading, Emma extracts the data to a specific directory on her computer.
+   - After the download, Emma extracts the data to a specific directory on her computer.

 2. **Setting Up:**
   - Emma opens a terminal, navigates to the project directory, and runs `pip install -r requirements.txt` to install the required Python packages.
@ -96,19 +96,19 @@ Let's walk through a user story to illustrate how to use our project, incorporat

 3. **Generating Data:**
   - To generate basic data files, Emma ensures she's in the project directory in the terminal. She then runs `generate_data.py` and manually adjusts the script beforehand to call the `main` function with `gen_data=True` and `gen_features=False`. This process generates several pickle files and may take some time.
-   - For generating machine learning features (optional), Emma adjusts the script to call the `main` function with `gen_data=False` and `gen_features=True` to generate a database file `.db`. This also may take some time.
+   - For generating machine learning features (optional), Emma adjusts the script to call the `main` function with `gen_data=False` and `gen_features=True` to generate a database file `.db`. This may also take some time.

 4. **Analysis:**
-   - With the data and features generated, Emma is now ready to dive into the analysis. She opens the provided Jupyter notebooks and can see the demographic plots, methods of feature detection and noise reduction. With the `filter_params.json` file she is also able to adujst paramters to see how it changes the noise reducing.
+   - With the data and features generated, Emma is now ready to dive into the analysis. She opens the provided Jupyter notebooks and can see the demographic plots, methods of feature detection and noise reduction. With the `filter_params.json` file she is also able to adjust paramters to see how it changes the noise reduction.

 5. **Deep Dive:**
-   - Interested in the features and the resulting machine learning accurarcies, Emma uses the signal processing notebooks to analyze patterns in the health data.
+   - Interested in the features and the resulting machine learning accuracies, Emma uses the signal processing notebooks to analyze patterns in the health data.
   - She adjusts parameters and runs different analyses, noting interesting trends and correlations.
   - After Training her own models, she can also compare here results with the included models of the `ml_models` directionary to evaluate the performance of her models. 

 6. **Sharing Insights:**
   - Emma compiles her findings into a report, using plots and insights generated from our project.
-   - She shares her report with her team, highlighting how features like the R Axis can influence health outcomes.
+   - She shares her report with her team, highlighting how features like the R axis can influence health outcomes.

 Through this process, Emma was able to leverage our project to generate meaningful insights into health data, demonstrating the project's utility in real-world analysis.

@ -132,7 +132,7 @@ Through this process, Emma was able to leverage our project to generate meaningf
 The following criteria were checked to ensure data quality:

 - Number of data records that did not specify gender
- Number of data sets that did not specify an age
+- Number of datasets that did not specify an age
 - Number of data records in which the signal length deviates from 5000 (10 seconds * 500 Hz)
 - Number of data records that could not be read in

@ -184,7 +184,7 @@ With those Classifiers, the hypothesis can be proven, that a classifier is able

 2. Sinus bradycardia occurs significantly more frequently in the 60 to 70 age group than in other age groups.
 Atrial fibrillation/atrial flutter also occurs significantly more frequently in the 70 to 80 age group than in other age groups.<br>
-      The second hypothesis was tested for significance using the chi-square test. The detailed procedure can be found in the following notebook: [statistics.ipynb](notebooks/statistics.ipynb)
+      The second hypothesis was tested for significance using the Chi-square test. The detailed procedure can be found in the following notebook: [statistics.ipynb](notebooks/statistics.ipynb)

    Results:

@ -209,7 +209,7 @@ The higher frequency of older people in the database may lead to a slight bias i
 The sample size in the study conducted may also play a role in the significance of the frequency.

 The significant appearance of atrial fibrillation/atrial flutter in the age group 70-80 could be caused by multiple factors.
-The physiological age is the main reason. With increasing age, various age-related changes in the cardiovascular system occur. Older people are more likely to have hypertension. The increased pressure can lead to thickening of the heart walls and a change of the structure, potentially leading to AFIB. Chronic inflammation which is more prevalent in older people, can damage heart tissue and lead to atrial issues. The change of hormone levels when getting older can also have an influence on the heart function and contribute to the development of arrhythmias. Older adults are also more likely to have comorbidities like diabetes, obesity or chronic kidney disease.
+The physiological age is the main reason. With increasing age, various age-related changes in the cardiovascular system occur. Older people are more likely to have hypertension. The increased pressure can lead to thickening of the heart walls and a change of the structure, potentially leading to AFIB. Chronic inflammation which is more prevalent in older people, can damage heart tissue and lead to atrial issues. The change of hormone levels when getting older can also have an influence on the heart function and contribute to the development of arrhythmias. Older adults are also more likely to have comorbidities such as diabetes, obesity or chronic kidney disease.
 <br>(source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5460064/, last visit: 28.06.2024)
 ### Noise reduction 
 (as of  12.06)
@ -250,20 +250,20 @@ For machine learning, the initial step involved tailoring the features for the m
 ## Cluster-analysis
 (as of  03.07)

-To enhance our understanding of the feature clusters and their similarity to the original data labels, we initiated our analysis by preparing the data. This preparation involved normalization and imputation of missing values with mean substitution. Subsequently, we employed the K-Means algorithm for distance-based clustering of the data. Our exploration focused on comparing two types of labels: those derived from the K-Means algorithm and the original dataset labels. Various visualizations were generated to facilitate this comparison.
+To enhance our understanding of the feature clusters and their similarity to the original data labels, we began our analysis by preparing the data. This preparation included normalization and imputation of missing values with mean substitution. Subsequently, we employed the K-Means algorithm for distance-based clustering of the data. Our exploration focused on comparing two types of labels: those derived from the K-Means algorithm and the original dataset labels. Various visualizations were generated to facilitate this comparison.

 A dimensionality reduction plot using Principal Component Analysis (PCA) revealed that, although the clusters formed by the K-Means algorithm and the original labels are not highly similar, both exhibit some degree of clustering. To quantitatively assess the quality of the K-Means clusters, we calculated the following metrics:
 - Adjusted Rand Index (ARI): 0.15
 - Normalized Mutual Information (NMI): 0.24
 - Silhouette Score: 0.47

-The ARI and NMI scores indicate that the clustering algorithm has a moderate level of effectiveness in reflecting the structure of the true labels, albeit not with high accuracy. These scores suggest some alignment with the true labels, but the clustering does not perfectly capture the underlying groupings. This suggests that the distances between features, as determined by the clustering algorithm, do not fully mirror the inherent categorizations indicated by the original labels of the data.
+The ARI and NMI scores indicate that the clustering algorithm has a moderate level of effectiveness in reflecting the structure of the true labels, although not with high accuracy. These scores suggest some alignment with the true labels, but the clustering does not perfectly capture the underlying groupings. This suggests that the distances between features, as determined by the clustering algorithm, do not fully reflect the inherent categorizations indicated by the original labels of the data.

-The Silhouette Score suggests that the clusters identified are internally cohesive and distinct from each other, indicating that the clustering algorithm has been somewhat successful in identifying meaningful structures within the data, even if these structures do not align perfectly with the true labels.
+The Silhouette Score suggests that the clusters identified are internally coherent and distinct from each other, indicating that the clustering algorithm has been somewhat successful in identifying meaningful structures within the data, even if these structures do not align perfectly with the true labels.

 Further analysis included the creation of a Euclidean distance matrix plot to visualize patterns of data point separation. This analysis revealed the presence of outliers, as some data points were significantly more distant from others.

- Finally, a parallel axis plot was generated to examine the relationship between data features and the clusters. Notably, this plot highlighted the ventricular rate feature as a significant separator in the original labels, underscoring its importance as identified by our machine learning models in predicting the labels.
+ Finally, a parallel axis plot was generated to examine the relationship between the data features and the clusters. Notably, this plot highlighted the ventricular rate feature as a significant separator in the original labels, underscoring its importance as identified by our machine learning models in predicting the labels.


 ![Alt-Text](readme_data/cluster_analysis.png)
@ -272,11 +272,11 @@ Further analysis included the creation of a Euclidean distance matrix plot to vi
 <br>The detailed procedures can be found in the following notebook:
 <br>[cluster_features.ipynb](notebooks/cluster_features.ipynb)

-## Legal Basis and Data Biases
+## Data Biases
 (as of  03.07)

 ### Local Bias
- The dataset originates exclusively from one hospital, encompassing contributions from Chapman University, Shaoxing People’s Hospital (affiliated with Shaoxing Hospital Zhejiang University School of Medicine), and Ningbo First Hospital. This may introduce a local bias, as all data are collected from a specific geographic and institutional context.
+- The dataset originates exclusively from one hospital, including contributions from Chapman University, Shaoxing People’s Hospital (affiliated with Shaoxing Hospital Zhejiang University School of Medicine), and Ningbo First Hospital. This may introduce a local bias, as all data are collected from a specific geographic and institutional context.

 ### Demographic Bias
 - The dataset predominantly features data from older individuals, with the majority of participants falling within the 60-70 age group. This demographic skew is further detailed by:
@ -296,9 +296,9 @@ The data used in the project was approved by the review boards of Shaoxing Peopl

 This project has impressively demonstrated the feasibility and benefits of applying modern data analysis methods and machine learning in the field of cardiology. By using a large dataset of 12-lead ECGs, it was possible to effectively classify different cardiac arrhythmias using models such as XGBoost, gradient boosting and decision trees. These models achieved classification accuracies of and above 80%, highlighting the importance of accurate diagnostic tools.

-Despite these successes, we encountered challenges such as the lack of datasets for certain demographic groups and the handling of incomplete ECG recordings. These limitations emphasize the need for further research to improve data collection and processing in medical studies.
+Despite these successes, we encountered challenges such as the lack of datasets for certain demographic groups and the handling of incomplete ECG recordings. These limitations highlight the need for further research to improve data collection and processing in medical studies.

-The application of these analytical techniques not only provides the opportunity to make more accurate and faster diagnoses, but also opens avenues for the development of personalized treatment approaches tailored to specific patient-individual data. 
+The application of these analytical techniques not only provides the opportunity for more accurate and faster diagnoses, but also opens avenues for the development of personalized treatment approaches tailored to specific patient-individual data. 

 Ultimately, our research shows that the continued integration and improvement of technological solutions in medical diagnostic procedures is essential for future healthcare. We recommend continuing research in this direction.

@ -307,7 +307,7 @@ Ultimately, our research shows that the continued integration and improvement of
 ## Outlook
 (as of  03.07)

-As data science advances, there are several opportunities to improve and expand current research on analyzing cardiovascular disease using ECG data. 
+As data science advances, there are several opportunities to improve and extend current research on cardiovascular disease analysis using ECG data.  
 Key future directions include:

 - Advanced machine learning techniques: Incorporating more machine learning methods such as deep learning could improve the accuracy and reliability of cardiovascular disease diagnosis. 
@ -318,7 +318,7 @@ Key future directions include:

 - Integration of additional data sources: Expanding the database to include more diverse datasets from different geographic and demographic contexts could help mitigate local and demographic biases. 

- Visual Data comparsion: With the use of the given ECG data there could be calculated a "standardized" QRS-Complex for every Diagnosisgroup which were examined in this program to classify new unknown ECG data with no prior diagnosis.
+- Visual data comparsion: With the use of the given ECG data there could be calculated a "standardized" QRS-Complex for every diagnosis group examined in this program, to classify new unknown ECG data with no prior diagnosis.



@ -344,7 +344,7 @@ Thank you for your interest in contributing to our project! As an open-source pr
  4. Push your changes to your branch.
  5. Submit a pull request to our repository. Include a clear description of your changes and the purpose of them.

-Please note, by contributing to this project, you agree that your contributions will be licensed under its MIT License.
+Please note that by contributing to this project, you agree that your contributions will be licensed under its MIT License.

 We look forward to your contributions. Thank you for helping us improve this project!