Gazing into the Data Abyss: Employee Attrition

Shawn D'Souza
May 29, 2019
4 min read

Hindsight: For the complete code and steps taken with the data analysis, kindly look at the code at https://www.kaggle.com/shawnd29/gazing-into-the-data-abyss-employee-attrition

One thing is for sure, there is a LOT of data out there. Ever since we have built on the idea of assigning values to items, we have always been thinking about what-if scenarios on how we can make the most out of what we have.

It is no surprise that over the several years of exploratory advancements with numbers, we have at the cusp of unlocking the next big thing when it comes to Data Analysis. What we can do with the data we have and how we can optimize it to maximize our objectives, it seems almost magical.

With so many applications out there, I find it amazing to dive deeper into the realm of analytics and see what amazing insights can be taken. With that, I have chosen the concept of employee attrition and what steps we can take when looking at the signs to maximize employee retention.

Attrition:What makes them stay?What makes them go?

Employee retention is one of the biggest metrics that a company should have in mind when thinking of growth. It becomes all the more necessary when you consider the costs of finding, interviewing and hiring the right candidate for the various jobs within the organization. This is all the more detrimental when finding out why people leave the company at an unprecedented rate. Employee attrition is caused when the total strength of the company is greatly reduced as more employees leave the company than expected.

This data set comes from a fictitious scenario come up by the data scientists at IBM. Although it is not attuned to the real-word scenario, it still is a viable dataset to mine and build insights on. The main crux of the data is to classify what type of people would leave a company based on 35 parameters. Our categorical target for this dataset is a Yes/No classification from the attrition.

With that in mind, the various steps that we have gone into building insights from start to finish are highlighted in FOUR major blocks:

Collecting and Cleaning the Data
Exploring the data
Transforming the data to build insights
Applying machine learning models for feature engineering

With these at the forefront, it highlights the general steps that can be taken to work towards any dataset. For the given attrition dataset, I wanted to focus more on the graphical and visual representation aspect while keeping the machine-learning concepts in the back-burner.

STEP 2: Exploring the data

Since the given dataset is split into a numeric values such as Age and labeled values such as Department, it seems prudent to partition the data on quantitative and categorical values. This makes it easier to fine-tune the model for accurate results.

One way of visually communicating how data is spread is through histograms. This can give general insights in terms of data distribution.

Another way of estimating relations between two values arise from the Kernel Density Estimate (KDE) plots. This makes it easier to highlight the bivariate relationship between data distributions that occur within the framework.

STEP 3: Transforming the data to build insights

Data in its innate nature needs to be molded so that we can mine it for insights. The main drive behind this is to fine-tune the data so that we can build actionable metrics from the data as a whole. With this, both the categorical and quantitative data ca be shaped.

When working with categorical values, there may or may not exist a quantifiable priority within the labels. Therefore, one-hot encoding is used to give a quantifiable value to the categorical labels.

A correlation matrix is the fastest way to see which metrics are related to each other. Although correlation does not mean causation, it is a great way to estimate the underlying behaviors within the dataset.

STEP 4: Feature Engineering

Now that the data has been prepped for analysis, it is time to apply machine learning methodologies to the dataset.

The following technique were used to implement the dataset to estimate the best model that would be used

1. Logistic Regression – Sine there are 1/0 value datasets, it makes it ideal to estimate.

2. Decision Tree – An easy to implement classification method.

3. Random Forrest – Decision trees with overfitting prevention

4. SVM – Classify data based on its location on the hyperplane

5. KNN - Classify data based on its location with similar clusters

With these implemented, we get an estimate on what would be the best approach to take.

As we can see, SVM and Random forests take the lead. We can break down the metrics to see what the most driving factors are:

Outcomes and key takeaways

Based on the above results, how much people earn and how old they are greatly affect their reason for changing to a different company. In a world where we constantly strive to know our worth, it goes to say that we will find greener pastures if we do not find it at a single point. Ultimately, it falls on the company to maximize their employee retention with a focus on their pay. Although it might decrease net company revenue in the short run, the cost of rehiring and re-training each and every time an employee leaves will definitely leave a mark.

All in all, there are so many ways to approach the same data problem from different angles. With practice, the end goal will be to form an effective story with the data to provide actionable insights. Thank you very much for coming along with me on this journey. This was a great learning experience for me and I hope that it was for you as well.