diff --git a/README.md b/README.md index dff24f8..77d77ef 100644 --- a/README.md +++ b/README.md @@ -243,8 +243,22 @@ For machine learning, the initial step involved tailoring the features for the m ### Cluster-analysis (version 03.07) -[](notebooks/...) +To enhance our understanding of the feature clusters and their similarity to the original data labels, we initiated our analysis by preparing the data. This preparation involved normalization and imputation of missing values with mean substitution. Subsequently, we employed the K-Means algorithm for distance-based clustering of the data. Our exploration focused on comparing two types of labels: those derived from the K-Means algorithm and the original dataset labels. Various visualizations were generated to facilitate this comparison. + +A dimensionality reduction plot using Principal Component Analysis (PCA) revealed that, although the clusters formed by the K-Means algorithm and the original labels are not highly similar, both exhibit some degree of clustering. To quantitatively assess the quality of the K-Means clusters, we calculated the following metrics: +- Adjusted Rand Index (ARI): 0.15 +- Normalized Mutual Information (NMI): 0.24 +- Silhouette Score: 0.47 + +The ARI and NMI scores indicate that the clustering algorithm has a moderate level of effectiveness in reflecting the structure of the true labels, albeit not with high accuracy. These scores suggest some alignment with the true labels, but the clustering does not perfectly capture the underlying groupings. This suggests that the distances between features, as determined by the clustering algorithm, do not fully mirror the inherent categorizations indicated by the original labels of the data. + +The Silhouette Score suggests that the clusters identified are internally cohesive and distinct from each other, indicating that the clustering algorithm has been somewhat successful in identifying meaningful structures within the data, even if these structures do not align perfectly with the true labels. + +Further analysis included the creation of a Euclidean distance matrix plot to visualize patterns of data point separation. This analysis revealed the presence of outliers, as some data points were significantly more distant from others. Finally, a parallel axis plot was generated to examine the relationship between data features and the clusters. Notably, this plot highlighted the ventricular rate feature as a significant separator in the original labels, underscoring its importance as identified by our machine learning models in predicting the labels. + +
The detailed procedures can be found in the following notebook: +
[cluster_features.ipynb](notebooks/cluster_features.ipynb) ## Legal basis (version 03.07)