Merge branch 'main' of https://gitty.informatik.hs-mannheim.de/1826514/DSA_SS24

2024-06-21 16:30:57 +02:00 · 2024-06-21 16:30:57 +02:00 · 34dd8ca284
parent 8c294b3b65 2f89518bbf
commit 34dd8ca284
3 changed files with 105 additions and 55 deletions
--- a/README.md
+++ b/README.md
@ -145,6 +145,8 @@ The colour scale represents the correlation between the two types of categorizat
 - Blue (low)
 - Red (high)
 Notably, the age group between 60-70 in SB shows a noticeably higher correlation and will therefore be considered in our hypothesis analysis. The other groups align mostly as expected, with a slight increase in correlation observed with age.
 The exact procedure for creating the matrix can be found in the notebook [demographic_plots.ipynb](notebooks/demographic_plots.ipynb).
 ![Alt-Text](readme_data/Korrelationsmatrix.png)
@ -192,17 +194,32 @@ With those Classifiers, the hypothesis can be proven, that a classifier is able
    The sample size in the study conducted may also play a role in the significance of the frequency.
 ### Noise reduction
-Noise suppression was performed on the existing ECG data. A three-stage noise reduction was performed to reduce the noise in the ECG signals. First, a Butterworth filter was applied to the signals to remove the high frequency noise. Then a Loess filter was applied to the signals to remove the low frequency noise. Finally, a non-local-means filter was applied to the signals to remove the remaining noise.
+Noise suppression was performed on the existing ECG data. A three-stage noise reduction was performed to reduce the noise in the ECG signals. First, a Butterworth filter was applied to the signals to remove the high frequency noise. Then a Loess filter was applied to the signals to remove the low frequency noise. Finally, a non-local-means filter was applied to the signals to remove the remaining noise. For noise reduction, the built-in noise reduction function from NeuroKit2 `ecg_clean` was utilized for all data due to considerations of time performance.
 How the noise reduction was performed in detail can be seen in the following notebook: [noise_reduction.ipynb](notebooks/noise_reduction.ipynb)
 ### Features
-The detection ability of the NeuroKit2 library is tested to detect features in the ECG dataset. Those features are important for the training of the model in order to detect the different diagnostic groups. The features are generated using the NeuroKit2 library.
+The detection ability of the NeuroKit2 library is tested to detect features in the ECG dataset. Those features are important for the training of the model in order to detect the different diagnostic groups. The features are detected using the NeuroKit2 library.
 For the training, the features considered are: 
 - ventricular rate
 - atrial rate
 - T axis 
 - R axis
 - Q peak amplitude
 - QT length
 - QRS duration
 - QRS count
 - gender
 - age
 The selection of features was informed by an analysis presented in a paper (source: https://rdcu.be/dH2jI, last accessed: 15.05.2024), where various feature sets were evaluated. These features were chosen for their optimal balance between performance and significance.
 The exact process can be found in the notebook: [features_detection.ipynb](notebooks/features_detection.ipynb).
 ### ML-models
-First, the grid was tested to find the best model which was then trained to identify the best hyperparameters out of it. That way, an accuracy of 83 % was achieved with the XGBoost classifier. The Gradient Boosting Tree Classifier had an accuracy of 82%.
+For machine learning, the initial step involved tailoring the features for the models, followed by employing a grid search to identify the best hyperparameters. This approach led to the highest performance being achieved by the Extreme Gradient Boosting (XGBoost) model, which attained an accuracy of 83%. Additionally, a Gradient Boosting Tree model was evaluated using the same procedure and achieved an accuracy of 82%. The selection of these models was influenced by the team's own experience and the performance metrics highlighted in the paper (source: https://rdcu.be/dH2jI, last accessed: 15.05.2024). The models have also been evaluated, and it is noticeable that some features, like the ventricular rate, are shown to be more important than other features.
 <br>The detailed procedures can be found in the following notebooks: 
 <br>[ml_xgboost.ipynb](notebooks/ml_xgboost.ipynb)
 <br>[ml_grad_boost_tree.ipynb](notebooks/ml_grad_boost_tree.ipynb)
--- a/notebooks/ml_grad_boost_tree.ipynb
+++ b/notebooks/ml_grad_boost_tree.ipynb
--- a/notebooks/ml_xgboost.ipynb
+++ b/notebooks/ml_xgboost.ipynb
@ -9,7 +9,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 36,
+   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
@ -20,10 +20,9 @@
    "import matplotlib.pyplot as plt\n",
    "import xgboost as xgb\n",
    "from sklearn.model_selection import GridSearchCV\n",
-    "from sklearn.metrics import confusion_matrix\n",
+    "from sklearn.metrics import confusion_matrix, f1_score\n",
-    "from sklearn.impute import SimpleImputer\n",
+    "import seaborn as sns\n",
-    "from sklearn.preprocessing import MinMaxScaler\n",
+    "import numpy as np"
    "import seaborn as sns"
   ]
  },
  {
@ -35,7 +34,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
@ -59,7 +58,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 42,
+   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
@ -337,7 +336,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 43,
+   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
@ -362,14 +361,14 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 44,
+   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "[20:16:51] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/learner.cc:627: \n",
+      "[16:58:49] WARNING: C:/Users/administrator/workspace/xgboost-win64_release_1.6.0/src/learner.cc:627: \n",
      "Parameters: { \"best_iteration\", \"best_ntree_limit\", \"scikit_learn\" } might not be used.\n",
      "\n",
      "  This could be a false alarm, with some parameters getting used by language bindings but\n",
@ -377,13 +376,7 @@
      "  but getting flagged wrongly here. Please open an issue if you find any such cases.\n",
      "\n",
      "\n",
-      "[0]\ttrain-merror:0.16762\teval-merror:0.22603\n"
+      "[0]\ttrain-merror:0.16762\teval-merror:0.22603\n",
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1]\ttrain-merror:0.15220\teval-merror:0.22374\n",
      "[2]\ttrain-merror:0.13849\teval-merror:0.21461\n",
      "[3]\ttrain-merror:0.13535\teval-merror:0.20776\n",
@ -483,8 +476,8 @@
      "[97]\ttrain-merror:0.00029\teval-merror:0.18265\n",
      "[98]\ttrain-merror:0.00029\teval-merror:0.18265\n",
      "[99]\ttrain-merror:0.00029\teval-merror:0.18265\n",
-      "CPU times: total: 17.6 s\n",
+      "CPU times: total: 14.3 s\n",
-      "Wall time: 1.36 s\n"
+      "Wall time: 1.22 s\n"
     ]
    }
   ],
@ -506,7 +499,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 45,
+   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
@ -546,7 +539,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 46,
+   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
@ -577,7 +570,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 31,
+   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
@ -609,7 +602,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
@ -618,7 +611,7 @@
       "<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>"
      ]
     },
-     "execution_count": 19,
+     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    },
@ -640,7 +633,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
@ -679,6 +672,26 @@
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "F1 Score: 0.8157211953487169\n"
     ]
    }
   ],
   "source": [
    "# Calculate F1 Score for multiclass classification\n",
    "f1 = f1_score(test_y, preds, average='macro')\n",
    "\n",
    "print('F1 Score:', f1)"
   ]
  }
 ],
 "metadata": {