# Hypothesis
This notebook is used to read the data from the pickle files and to test the hypothesis that in the age group of 60-70 the frequency of a sinus bradycardia is significantly higher than in the other age groups.
For that instance the chi-squared test is used.

In [4]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pickle
from scipy.stats import chi2_contingency
from data_helper import *


In [6]:
#path = "C:/Studium/dsa/data"
#path = "C:/Users/Nils/Documents/HS-Mannheim/0000_MASTER/DSA/EKG_Prog/data"
path = "C:/Users/klara/projects/DSA/data"

categories_dict = {
'SB':    [426177001],
'AFIB':  [164889003, 164890007],
'GSVT':  [426761007, 713422000, 233896004, 233897008, 713422000],
'SR':    [426783006, 427393009]
}

data = {}
for cat_name in categories_dict.keys():
    print(f"Reading {cat_name}")
    with open(f'{path}/{cat_name}.pkl', 'rb') as f:
        records = pickle.load(f)
        data[cat_name] = records
        print(f"Length of {cat_name}: {len(records)}")

data_demographic = {'age':[], 'diag':[], 'gender':[]}
for cat_name, records in data.items():
    for record in records:
        age = record.comments[0].split(' ')[1]
        sex = record.comments[1].split(' ')[1]
        if age == 'NaN' or sex == 'NaN':
            continue
        # cut Age: from alter string 
        data_demographic['age'].append(int(age))
        data_demographic['diag'].append(cat_name)
        data_demographic['gender'].append(sex)

df_dgc = pd.DataFrame(data_demographic)

# Change from group to category
age_categories = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90]
df_dgc['age_group'] = pd.cut(df_dgc['age'], bins=age_categories)
corr_matrix_age_diag= pd.crosstab(df_dgc['age_group'], df_dgc['diag'])

# Chi-square test
chi2, p, _, _ = chi2_contingency(corr_matrix_age_diag)

# Difference between observed and expected frequencies
print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")

# Check if SB (Sinusbradykardie) has a significantly higher frequency in the 60-70 age group
sb_60_70 = corr_matrix_age_diag.loc[pd.Interval(60, 70, closed='right'), 'SB']
sb_other = corr_matrix_age_diag.drop(pd.Interval(60, 70, closed='right')).sum()['SB']
total_60_70 = corr_matrix_age_diag.loc[pd.Interval(60, 70, closed='right')].sum()
total_other = corr_matrix_age_diag.drop(pd.Interval(60, 70, closed='right')).sum().sum()

# Frequency table for the specific Chi-Square test
observed = [[sb_60_70, total_60_70 - sb_60_70], [sb_other, total_other - sb_other]]
chi2_sb, p_sb = chi2_contingency(observed)[:2]


print(f"Chi-Square Statistic for SB in 60-70 vs others: {chi2_sb}")
print(f"P-value for SB in 60-70 vs others: {p_sb}")

Reading SB
Length of SB: 50
Reading AFIB
Length of AFIB: 27
Reading GSVT
Length of GSVT: 0
Reading SR
Length of SR: 13
Chi-Square Statistic: 38.266574797751275
P-value: 0.0004730210823940083
Chi-Square Statistic for SB in 60-70 vs others: 1.4858035714285718
P-value for SB in 60-70 vs others: 0.22286870264719977


The results can be interpreted as followed:

- The first value returned is the Chi-Square Statistic that shows the difference between the observed and the expected frequencies. Here, a bigger number indicates a bigger difference. The p-value shows the probability of this difference being statistically significant. If the p-value is below the significance level of 0.05, the difference is significant.

- The Chi-Square Statistic for sinus bradycardia in the age group 60-70 compared to the other age groups, is a value that shows whether there is a significant difference in the frequency of sinus bradycardia in the age group 60-70 in comparison to the other age groups. If the p-value is smaller than the significance level of 0.05, the difference in the frequency between the age group 60-70 and the other age groups is significant.