Exploring Heart Failure with Data

Written by Muhammad Huzaifa Khan Suri, Ayan Tabassum, Yumna Sohail and Abdul Moeed Asad as part of the course project for ‘Principles and Techniques of Data Science’ at the Lahore University of Management Sciences (LUMS)

According to the World Health Organization (WHO), more people die annually from Cardiovascular diseases (CVDs) than any other causes. In 2016, almost a third of all deaths were attributed to CVDs. The risk of cardiovascular disease substantially increases due to the presence of hypertension, diabetes or other established diseases. However, cardiovascular diseases can be sufficiently mitigated by addressing behavioural risk patterns such as tobacco use, unhealthy diet and obesity. Therefore, a system that can identify at-risk populations more accurately can be a great boon to society. Using a dataset from the Faisalabad Institute of Cardiology and the Allied Hospital in Faisalabad (Punjab, Pakistan) available on Kaggle we attempt to develop a model that can be used to predict heart failure and in doing so identify at-risks populations. We began our study by asking: Can we predict heart failure based on blood quality data? In doing this we wanted to understand:

  • What population groups are at a greater risk of heart failure?

Our dataset contains 299 medical reports with 13 features that report clinical, body, and lifestyle information. It contains six Boolean values and seven numeric values. Table 1 summarizes the 13 features with their explanation, unit of measurement, and range. It is important to note that in this dataset patients had left ventricular systolic dysfunction and had previous heart failures that put them in classes III or IV of the New York Heart Association (NYHA) classification of the stages of heart failure. Therefore we expect some features to show abnormal ranges.

Table 1: Features of the dataset. (mcg/L: micrograms per liter. mL: microliter. mEq/L: milliequivalents per litre)

For code, please visit the Google Colab notebook.

1. Data Cleaning

As part of data cleaning, we take a look at the values of our dataset.

Figure 1-1: Using the head method to view the top 5 rows

We confirm that our dataset has the correct ranges (see table 1) and contains no null values.

Figure 1-2: Using the info method to confirm that our data has no NULL values

Furthermore, we understand the normal ranges of the features found in our dataset. Each range is cited from a trusted source linked besides it:

  • Creatine Phosphokinase: 10–120 micrograms per liter (mcg/L) [source]

Our dataset, however, contains redundant features that distract from the dataset and binary values that are hard to make sense of. In order to improve our datasets readability, we do the following procedures:

Remove the time variable

The time variable indicates how much time has passed since the patient’s last visit for their check-up. We remove the time variable because it carries no significance for further analysis.

Change the binary values in the sex variable

We change the sex variable from `1` and `0` to `M` and `F`, respectively to improve the readability of our dataset.

Change the name of the “Creatinine Phosphokinase” column

While looking up standard ranges for blood constituents in our dataset, we noticed that one column was mislabelled: CPK stands for Creatine Phosphokinase instead of Creatinine, the former being an enzyme while the latter being a waste product of the breakdown of the same enzyme. Hence, we change the column name before proceeding forward in order to avoid any confusion.

Making all columns lowercase

We change the columns were made lowercase in order to make them coherent with each other.

Figure 1-3: Code for cleaning our code

2. Exploratory Data Analysis

It’s time to visualize our data to better understand it and identify patterns. Firstly, we begin by looking at the ratio of males to females, smokers to non-smokers, diabetic to non-diabetic, high blood pressure to no high blood pressure and anaemia to no anaemia in our data. This gives us a better insight into the proportion of our data set.

Figure 2-: Exploration of binary data in the dataset

From the plots above, we see that our data is unbalanced. We have a higher ratio of males as compared to females. Similarly, we have more patients who are non-smokers, non-diabetic and do not have high blood pressure or anaemia.

Searching for correlations

Now let’s see how these different features are related to one another. In order to do that, we create the following heatmap to visualize attributes and their correlations.

It can be observed that the correlation is weak among the different variables, which suggests that there is no underlying correlation in our dataset. It is also possible that an underlying correlation is not visible on the heat map because of the small size of our dataset. It is interesting to note that some blood constituents and habits have positive and some have negative correlations with the death event.

EXPLORING DISTRIBUTION IN FEATURES

Age and survival rate

Observing a general violin plot of our dataset that is grouped between patients who died and survived, we can see that the median age of people who died is higher, and the distribution is more spread out towards older ages as opposed to the distribution of people who survived, which is more concentrated at the middle age group.

Figure 2–4: Age and its relation with death event

Outliers

Next, let’s spot any potential outliers in our data.

Figure 2–5 (a) Age measured in years

We observe no outliers in the age variable. Unsurprisingly, we also see that more than 50 percent of our patients are aged between 50 and 70.

Fig 2–5 (b) Creatine phosphokinase measured in mcg/L

Interestingly, we observe numerous outliers in the levels of creatine phosphokinase. But we also notice that all these outliers are outside the normal range (highlighted by the blue lines on the plot) and hence need not to be removed. A possible explanation for this could be that majority of the patients suffering from heart failure have higher levels of creatine phosphokinase in their blood, since a high amount of this constituent indicates muscle weakness or breakdown.

Fig 2–5 (c) Ejection fraction measured in percentage

We have just 2 outliers in this and again the outliers lie outside the normal ranges for ejection fraction. Since we are working with the dataset of heart failure patients, it is not very surprising that majority of the patients have ejection factors below the normal range of 50.

Fig 2–5 (d) Platelets measured in kilo-platelets/mL

Most of the patients have their platelets in the normal ranges highlighted on the plot; however we still get a few outliers. After researching, we found that it is possible to have platelets outside the normal range so treating these values as outliers might disrupt our predictions later.

Fig 2–5 (e) Serum Creatinine measured in mg/dL

We observe a similar trend that more than 50 percent patients have the serum creatinine levels in the normal range but a lot of patients have very high levels of it which may indicate some sort of underlying condition.

Fig 2–5 (f) Serum Sodium measured in mEq/L

Here, most of our data lies within the acceptable range, although some patients have levels lower than the normal range of serum sodium, which may again indicate some underlying condition.

An important thing to note here is that we will not be removing any outliers since we are dealing with patients who are very likely to have abnormal amounts of different blood constituents. Removing these would leave us with data within the normal ranges only and might not be helpful in later sections as we develop a prediction model.

SURVIVAL RATES BY FEATURE

We would like to see how sex or the presence of different blood constituents affected the death rate of patients in our dataset.

First, let us take a look at the sex:

Figure 2–6 (a) Survival rates by sex

As seen from the pie chart, the proportion of death among females is almost the same as the proportion of death among males in our dataset. A numeric analysis of deaths would not have given this result, as shown in the bar plot, since our dataset is unbalanced and hence proportions are more revealing than counts.

Figure 2–6 (b) Survival rates by smoking

We again see that more non-smokers died as compared to smokers.

Figure 2–6 (c) Survival rates by diabetes

Looking at the proportion of people with and without diabetes, we observe the same trend as gender: more non-diabetic patients passed away, but the proportion of people who died with and without diabetes remains the same.

Figure 2–6 (d) Survival rates by blood pressure

For patients with high blood pressure though, we observe the proportion of deaths among patients with high blood pressure were 7.7% more than the proportion of deaths among patients who did not have high blood pressure. This suggests a correlation between deaths and high blood pressure, although we cannot assume causality at this point.

Figure 2–6 (e) Survival rates by Anaemia

Again, analysing plots for anaemia, we observe that the proportion of deaths among patients with anaemia were 6.6% more than the proportion of deaths among patients without anaemia. The presence of anaemia among patients also suggests a relationship with the deaths.

FURTHER ANALYSIS OF SMOKING AND BLOOD CONSTITUENTS

Let us now look at how smoking practices and certain blood constituents affect heart failure rates for different age groups. A common perception is that middle-aged people show the greatest resistance to heart diseases, so let us see if our dataset seems to reaffirm this statement.

From the wider spread of the violin plot we can say that the survival rate is higher for ages 40 to 70. However, the chances of not surviving are spread out through all ages. We can also observe that the survival is high for males in the ages of 50 to 60 whereas it’s higher for females in the ages of 60 to 70.

Let us try the same analysis for survival rate of smokers with respect to age.

From this plot, we observe that the survival rate at an older age (e.g. 70+) are higher for non-smokers compared to smokers. From smokers, more people survived in the age bracket of 50 to 60, whereas more non-smoker survived in the ages 50–70.

Observing this plot, we notice that a large number of both, diabetic and non-diabetic, patients survived in the ages of 40 to 70. But we also see that for ages above 70 there is no case of survival for a diabetic patient.

Some of the key points to take from this plot include no survival cases in the presence of high blood pressure for ages above 80. However, in case of no high blood pressure, the survival cases are spread out through all ages and we might say that it is indicative of a higher chance of survival if an individual does not have high blood pressure. This is something we can explore going forward.

Given the wider spread of the plot for patients who survived, we might say that a lot of patients survived despite having anaemia. We also see that the survival cases are higher for people in the age group of 50 to 70.

DISTRIBUTION OF BLOOD FEATURES vs DEATH EVENT

The distribution of patients who survived within normal creatine phosphokinase (CPK) levels are considerably more than those who did not survive. However, as we move outside the normal range, the number of patients who survived are drastically decreasing, and as we move towards the upper extreme (≥5800) of CPK levels, we only seem to find patients who did not survive.

The distribution of patients who survived within or close to the normal Ejection Fraction levels are considerably more than those who did not survive. However, as we move outside the normal range, the number of patients who survived are considerably decreasing, and as we move towards the lower extreme of the Ejection Fraction levels, there are considerably more patients who did not survive than those who did.

The distribution of patients who survived within the normal Platelets levels are considerably more, especially towards the middle range, than those who did not survive. However, as we move outside the normal range, the number of patients who survived and did not survive are almost identical and low in count.

The distribution of patients who survived within the normal Serum Creatinine levels are significantly more than those who did not survive. However, as we move outside the normal range, towards the upper extreme, the number of patients who did not survive drastically decrease until the number of patients who did not survive are more than those who did, in the same range of Serum Creatinine levels. This might be indicative of a correlation; however, this conclusion is not concrete at this stage.

The distribution of patients who survived within the normal Serum Sodium levels are considerably more than those who did not survive. However, as we move outside the normal range, at both sides of the extreme, the number of patients who survived are either considerably lower than that in the normal range or are lower than that of those who did not survive in the same range.

A general trend that we have observed when we relate blood features with death event is that the proportion of patients who survived is considerably more than those who did not survive within the normal blood constituent levels. However, as we move outside the normal range, the proportion of people who survived starts to decrease, and as we move outside even further, we mostly find people who did not survive.

Insights from EDA

In our EDA we have explored the ways in which our dataset is unbalanced. We have also answered some of our initial questions about heart failure. We have identified that the older a person gets the greater the risk of heart failure becomes. Surprisingly, smoking and diabetes are not heavily correlated with death event in our dataset.

We have explored the distribution of values within features and as they relate with each other but specifically how they relate with the death event. We focus on death event because that indicates severe heart failure. We are now in a place to thoughtfully start the next phase of our study — statistical inference, modelling and prediction.

3. Statistical Inference

In this section on statistical inference, we want to confirm the health of our dataset. To do this, we will test to see how accurately our dataset corresponds to the general population. We will achieve this by hypothesis testing on features hard to test otherwise.

We will come up with a null and alternative hypothesis for each feature and with the help of bootstrapping, we will come to a conclusion.

We are interested in confirming Anemia, Smoking and Diabetes in our population and how that matches with the population in general.

Anaemia

According to this research article, the prevalence of Anemia for individuals classified within class III or IV of NYHA classification is 0.19.

Null: The probability that a participant within our dataset is anaemic is equivalent to the prevalence of Anemia in the general population of those classified within class III or IV of NYHA classification of heart failure.

Alternative: The probability that a participant within our dataset is not anaemic is NOT equivalent to the prevalence of Anemia in the general population of those classified within class III or IV of NYHA classification of heart failure.

Now, we are going to Bootstrap to simulate 10,000 values under the null statistic.

95% confidence interval
Lower bound: 0.27
Upper bound: 0.45

Result
Our null statistic was 0.19 and is outside our 95% confidence interval.
Therefore, we reject the null statistic. Based on our results, it is evident that the prevalence of Anemia in our dataset is not equivalent to the prevalence of Anemia in the general population.

DIABETES

Based on the findings of this research article, the prevalence of Diabetes for individuals classified within class III or IV of NYHA classification is 0.304.

Null: The probability that a participant within our dataset is diabetic is equivalent to the prevalence of diabetes in the general population of those classified within class III or IV of NYHA classification of heart failure.

Alternative: The probability that a participant within our dataset is diabetic is NOT equivalent to the prevalence of diabetes in the general population of those classified within class III or IV of NYHA classification of heart failure.

Now, we are going to Bootstrap to simulate 10,000 values under the null statistic.

95% confidence interval
Lower bound: 0.3
Upper bound: 0.49

Result
Our null statistic was 0.304 and is within our 95% confidence interval. Therefore, we fail to reject the null statistic. The prevalence of diabetes in our dataset is evidently equal to the prevalence of diabetes in the general population, as shown by our findings.

SMOKING

According to this research article, that correlates Tobacco Smoking and Cardiovascular Disease, the prevalence of smoking in cardiovascular diseases among those individuals that fall between the age range of 45–64 and classified within class III or IV of NYHA classification is 0.355.

Null: The probability that a participant within our dataset smokes is equivalent to the prevalence of smoking in the general population of those classified within class III or IV of NYHA classification of heart failure.

Alternative: The probability that a participant within our dataset smokes is NOT equivalent to the prevalence of smoking in the general population of those classified within class III or IV of NYHA classification of heart failure.

Now, we are going to Bootstrap to simulate 10,000 values under the null statistic.

95% confidence interval
Lower bound: 0.23
Upper bound: 0.41

Result
Our null statistic was 0.355 and is within our 95% confidence interval. Therefore, we fail to reject the null statistic. Our findings show that the prevalence of smoking in our sample is equivalent to the prevalence of smoking in the general population.

Insights from Statistical Inference

From our findings of statistical analysis of all three individual features (Anemia, Diabetes, and Smoking), we rejected the null hypothesis of only Anemia.

Based on the results, as we fail to reject the null hypothesis on a majority of the features, we can assume that our dataset is a good representative of the general population and as such we will use our dataset for further analysis.

4. Machine Learning

We will now be moving into the final stages of our analysis and developing a machine learning model to predict heart failure. In order to decide on an algorithm, we will be implementing the following algorithms on our dataset:

a) k Nearest Neighbors Classifier
b) Logistic Regression Classifier
c) Random Forest Classifier

Our dataset contains records of 299 patients, which is a relatively low number of data points to train a machine learning model, hence we have decided on a 90–10 train-test split in order to develop a reasonable model that is well trained.

a) kNN Classifier

The kNN classifier computes distances between the test point and all the training points and assigns the most frequently occurring label among the k nearest training points to the test point.

The following is a summary of several performance metrics after running the kNN classifier on our data from the scikit-learn library:

We see that we get a 73.3 % accuracy and a 0.77 F1 score using a kNN classifier.

b) Logistic Regression Classifier

The logistic regression classifier uses training points to compute weights that give the lowest training loss, and then uses these set of weights to determine whether a test point belongs to a certain class using an activation function and a boundary condition. Since the algorithm actively tries to minimize losses, it generally performs better than the kNN algorithm.

The following is a summary of several performance metrics after running the logistic regression classifier on our data from the scikit-learn library. We have allowed 1000 iterations to train weights, and have also employed L2 regularization to prevent overfitting:

We see that we get a 83.3% accuracy and a 0.84 F1 score using our logistic regression classifier.

3. Random Forest Classifier

Random Forest Classifier is a very effective and popular algorithm that constructs multiple decision trees in the training instance and outputs the class that is the mode of the classes of the individual trees in the test instance. Another advantage of using this algorithm is that it allows us to assign certain weights to classes which helps to eliminate any issues that a data imbalance might create. Since our dataset is unbalanced, we have assigned weight 1 to the ‘0’ label and weight 2 to the ‘1’ label. We have also allowed 500 trees to be created in order to compensate for the lack of data points in our dataset.

The following is a summary of several performance metrics after running the random forest classifier on our data from the scikit-learn library:

We see that we get a 90.0 % accuracy and a 0.89 F1 score using a random forest classifier.

To get an overview of how much each attribute is contributing to our model, we constructed a visualization for feature importance.

It seems that the top 3 most important features are:

  • Serum creatinine

Interestingly enough, attributes like diabetes, anaemia, sex, high blood pressure and smoking do not contribute much to our prediction.

Lastly, the graph below compares the performance of these three algorithms. We observe that the macro-average F score is lowest for kNN classifier and highest for Random Forest classifier.

Conclusion

In the beginning, we asked, can we predict heart failure based on blood quality data? After doing an extensive data cleaning, EDA, statistical inference and machine learning process we can confidently say that yes. With our random forest model, we have shown that blood quality data is sufficient for predicting heart failure with an accuracy of 90%.