Predicting Life Expectancy of Countries Using Regression Model

Simran Mayra
6 min readJan 18, 2021

Which factors are most heavily weighed when predicting a region’s life expectancy?

The World Health Organization (WHO) released a health data set concerning life expectancy with many factors in the 4 significant categories being health, social, economic, mortality, and immunization. Moreover, the data contains information from 193 countries from the period 2000–2015 & was made public to perform health analysis. The final database was quite large, consisting of 22 Columns and 2938 rows with 20 predicting variables sorted in the categories.

Building My Own Regression Model

Past research on the subject was done based on a data set of just one year for all of the countries. I wanted to formulate a regression model based on mixed effects model and linear regression with data from 2000–2015 for all the countries.

Here’s what I did:

  1. Created data visualization for any one of the 20 variables (Hep B, Income Composition of Resources, Measles, Alcohol, BMI, etc) & a fixed variable of life expectancy to see correlation through a polynomial & linear regression
  2. Used Pearson’s correlation for PLR & LR to predict life expectancy for a particular country based on values inputted for 2 factors with a high weightage

Some significant predicting factors that affect life expectancy are the income composition of resources, Hepatitis B, schooling, alcohol, BMI, diphtheria, polio, percentage expenditure, total expenditure, and GDP.

To find these core factors, you can use principal component analysis (PCA). A PCA is used to observe trends, jumps, clusters, and outliers. It represents a multivariable data table as a smaller set of variables. In other words, PCA is utilized for extracting data from a high dimensional space to a lower one.

Furthermore, Income composition of resources & Hepatitis B immunization seems to play the most substantial role in the average length of someone’s life- which is why I used these as default variables.

Hepatitis B & Life Expectancy Correlation

As expected, the increase in immunization coverage against viruses such as Hepatitis B, Polio, Diphtheria, and Measles has resulted in higher life expectancy amongst developed countries in comparison to developing ones.

Right now, we’re going to focus on Hep B specifically. This virus is a considerable health problem globally, causing chronic infection & increasing death risk by Liver Cancer & Cirrhosis. The virus is often transmitted from a mother to child during birth through contact with bodily fluids, including sex with an infected partner, any injection drug use where needles, syringes, and other equipment are shared. In highly endemic areas such as the Western Pacific Region & the African Region, chronic infection development is actually quite common in infants infected before 5 years.

Want to read this story later? Save it in Journal.

In infants and children:

  • 80–90% of infants infected during the first year of life develop chronic infections; and
  • 30–50% of children infected before the age of 6 years develop chronic conditions.

The HBV birth dose worldwide coverage is 43%, and in the African Region, it is a mere 6%. This is one of the many reasons why this region has a lower average life span. These immunization factors really do play a role in decreasing a country’s overall life expectancy.

Now that we realize the significance of a country having high immunization rates with these transmittable viruses, we can, for example, aim to support countries in achieving the global hepatitis elimination targets under the Sustainable Development Agenda 2030. The World Health Organization has outlined the steps to attain these targets. Some main action items include:

  • Raising awareness about HBV
  • Mobilizing resources.
  • Increasing health equities within the hepatitis response
  • Scaling screening processes for HBV
  • Improving care/treatment service in developing countries

The value for Hepatitis B that can be inputted in the regression model is the percentage of immunization coverage among 1-year-olds in a specific country.

Income Composition of Resources & Life Expectancy Correlation

The income composition of resources is a Human Development Index between 0 and 1 and is calculated based on income and resource availability. Furthermore, the composition of the total income of a country refers to the share of each income source, which is then expressed as a percentage of the nation’s total income.

Regression analysis is described as “Using the relationship between variables to find the best fit line or the regression equation that can be used to make predictions.”

The regression analysis of the income composition of resources & life expectancy is a form of predictive modelling whose purpose is to investigate the relationship between a dependent and independent variable. In this case, the income composition of resources is the independent variable, and life expectancy is the dependent variable.

Here is Germany’s income composition of resources correlation with life expectancy.

Here is Canada’s Income composition of resources correlation with life expectancy.

These are both linear and polynomial regressions used to visual the data and correlation between the independent and dependent variables.

Predicting Life Expectancy

Using Pearson’s Correlation function, we can accurately predict life expectancy. There are only 2 variables to input values for in my model, which is why I chose the ones that hold the most weightage in predicting life expectancy. These variables can be changed, and more can be added.

We know that correlation is used to investigate the relationship between quantitative, continuous variables. In this case, we have Hepatitis B immunization coverage (%), Income composition of resources (0–1 index), and life expectancy prediction (value of age).

One thing about Pearson’s correlation coefficient is how sensitive it is to outliers. The outliers have a significant effect on the prediction as well as a line of best fit. Including outliers leads to misleading results, which is why I removed all values in the data set that were infinity or nothing. This ensured the accuracy of the analysis.

In addition to this, Pearson’s correlation coefficient (r) is a measure of the association’s strength between the variables. You see that the correlation for LR is approximately 0.993 in the photo below. The prediction for Cambodia’s life expectancy is close to 67 years of age.

Why is it valuable to predict life expectancy using such models?

Look at it this way- countless factors play into the country’s average life expectancy and how long somebody may live. Making predictions using machine learning regressions gives us more insight into immunization and social factors we don’t often think about. By understanding correlations between multiple variables, we learn about the factors that affect lifespan the most.

Using this model I created, one thing we can do is predict a country’s life expectancy as infrastructure, society, and resources change over the years. We can also experiment with certain factors and their correlation to a high life expectancy.

Most importantly though, we can utilize the model and dataset to find areas of improvement in specific countries. By knowing which factors have the most significant role in a lower life expectancy, a country can decide to spend more money & resource on those certain things. Without a doubt, it’s crucial for nations to know where they are lacking & can do better, from factors in health, social, economic, mortality, and immunization categories. The more knowledge we have about the importance of these factors, the more success we will have when it comes to extending a country’s average life expectancy and creating a better quality of life for those who live there.

Thanks for reading my article! My code is posted on Github if you wanted to take a look. Also, be sure to follow me on Medium if you enjoyed this article. Take care everyone & have a great rest of the week :)