The insurance industry possesses huge amounts of data. Lots of analysis are run to get a better insight into fraud, risks and the value of portfolios. How reliable is all this information and how can it be made (even) more reliable?
As a data scientist one usually has to deal with large amounts of information. This can be data from clients, data from external sources and of course internal or proprietary data. In case of insurance, the collected information on the insured persons and objects, the claims, and detected fraud helps in making well-founded judgements about risks, trends, and the value of policies and portfolios.
The ideal world could be captured completely in figures and data fields. But how reliable is all this information? There are substantial pitfalls between my dream and reality: both in the systems and with us humans. Differences in culture, accuracy and consistency make it difficult to compare the contents of administrative systems. And to top it off, the human factor can have both a positive and a negative influence on data quality.
There is plenty of space and opportunity for improvement. Below I list a top 3 of pitfalls. Then I describe my ideal world and discuss the 3 steps that may help us approximate it.
The lack of international uniformity in how we record information
In the Netherlands a vehicle is identified by its number plate, in Belgium it is done by its chassis number. So insurers in these two countries register different data, making it difficult to determine whether specific data concerns one and the same vehicle. Addresses, dates of birth and family names are also an ongoing source of potential confusion and misunderstandings between countries.
Changing the manner in which registration systems are used
Over the years a decision may have been made to record certain information in a more specific way. For example, at first maybe only a category ‘theft’ was recorded and at a later stage ‘car theft’ was added. Or the field ‘project number’ was used for entering license plates. Often the knowledge of what and why changes were made is still available within an organization, but generally little is recorded and documented. For outsiders there is insufficient knowledge of the background of certain data. This leads to outcomes that are difficult to explain.
The influence of bias or prejudice
The data on investigated claims that we receive is often based on investigations that have not been carried out at random. There was a reason, a feeling, an indication why certain claims were investigated. This feeling might be valid, but there is a danger that one has searched for the justification of a preconception. If one would stop all fancy cars with young drivers, one would undoubtedly find irregularities. However, who says that checking out all white cars might not lead to the same outcome? If that kind of data is incorporated in models, one risks reinforcing a bias.
The ideal world
In the ideal world of data analysis, we would all work with uniform data, nationally and internationally. We would all use the same definitions and there would be no linguistic or cultural differences. Also, we would all use the same type of database, with the same fields and the same manner of entering information. In that ideal world, I am able to perform an independent and random analysis on a subset of all claims. This way, we are able to test to what extend our prejudices influence our knowledge rules.
Fortunately, more and more techniques are developed to detect unwanted biases. Unfortunately, getting there is probably a utopia. However, the following steps may bring us a little bit closer to my ideal world. This will not only facilitate the work of a data scientist, it will also improve the quality of analyses and thus the managerial decisions that are based on them.
Three steps forward
What can we improve?
Choose software systems for the long term
Also, choose a supplier that knows about insurance and preferably one that will still be operating in twenty years time. Organize a system carefully and, as much as possible, use it for the purpose it was intended to be used for. Document changes accurately. Do not change systems too quickly. Should change be necessary, spend sufficient time and energy on transferring the data. At all times prevent having to keep two systems up and running: rather have one system with sub-perfect data than two systems with perfect but inconsistent data.
Invest nationally, but preferably also internationally, in more uniformity of the data
Fraud networks cause extensive damage to insurers because they ignore national borders and differences in registration. In fact: fraud rings thrive on those borders. Fighting them would become effective if it is standard to register chassis numbers and social security numbers in order to improve the identification of people and vehicles. At this moment, national rules and legislation do not always allow for such standards. However, small steps can already bring improvements. Every international fraud network that is decommissioned directly leads to huge savings in the payout of claims.
Transformation, i.e. converting the data to a standard for analysis
All data should be converted to one standard data format. This is the ideal moment for us specialists to get together with the client to discuss: what is the current system like, what is its history, what are the custom fields and how should certain information be interpreted? It might seem as if we only work with hard figures and definitions, but there are many potential misunderstandings and cultural differences.
Regardless the quality of the data, when dealing with analyses and their outcomes, a critical human eye will always be indispensable. Unexpected results are interesting, but may have various causes. It is up to us to filter the bias, the impurities and the misunderstandings and provide reliable, clear analyses, and conclusions. Insurers can use these to improve the products, customer satisfaction and operations management. In this manner we keep the industry healthy.