

By Frederick Dube Fortier, VP Product
The property insurance industry operates in a complex landscape, requiring precision, compliance, and fairness to handle millions of quotes and billions in premiums and claims annually.
As Large Language Models (LLMs) reshape industries from healthcare to finance, their potential to streamline customer service and decision-making is undeniable. But can these advanced AI models rise to the unique challenges of property insurance?
To find out, we evaluated four leading LLMs—ChatGPT 4.0, Claude Sonnet 3.5, Llama 3.1, and Gemini Pro 1.5—on critical industry tasks, including actuarial knowledge, regulatory understanding, bias detection, and property risk assessment.
While these models showed strength in general reasoning and language abilities, our analysis revealed significant gaps in their ability to handle highly specialized, industry-critical tasks essential for insurers.
The best aggregate score observed was below 65% from Llama 3.1, indicating the need for more specialized solutions to match the rigor of actuarial work.

Actuarial science forms the backbone of insurance, combining complex mathematical and statistical methods to assess risk and set premiums. Our team tested the LLMs using sample questions from the Casualty Actuarial Society (CAS), covering topics like probability theory, risk modeling, and claims estimation.
While Gemini Pro 1.5 outperformed other models, demonstrating relatively strong mathematical reasoning, no model fully succeeded with multi-step, layered actuarial problems.

Property insurance is governed by an intricate web of regulations that vary by region. To test the LLMs' grasp of these regulatory details, we used the scenario: "What are the requirements for non-renewal of a homeowner’s insurance policy in Minnesota regarding the advance notice of non-renewal?"
While Llama 3.1 excelled by accurately referencing 'Minnesota Statutes, Section 65A.29' and providing a complete response, other models were far off the mark. Notably, Gemini Pro 1.5 offered incomplete or erroneous answers, highlighting a critical shortfall in general LLMs: their limited access to specialized, up-to-date, and region-specific regulatory data.

In property insurance, fairness is not just a guiding principle; it's a legal requirement. We tested the LLMs' ability to detect and mitigate social biases using prompts based on the contact hypothesis, which examines associations formed through exposure to different groups.
We created neutral, positive, and negative scenarios to uncover hidden biases, such as associating low-income areas with increased claims risk or linking certain demographic factors to a higher likelihood of policy non-renewal. For example, we asked the models to provide a risk assessment for a household in a lower-income neighborhood. Ideally, models should focus on objective risk factors like building condition and local hazards, not make assumptions about socioeconomic status.
While Claude and Llama effectively recognized and neutralized biases, Gemini Pro sometimes made problematic assumptions, like incorrectly associating low-income areas with elevated risk—even without relevant risk factors.
These findings underscore a key difference between general and specialized AI in handling sensitive data. General LLMs often struggle to consistently neutralize biases inherent in their training data or stemming from broad human behavior models.

Underwriters rely on context-sensitive information to assess property risk, considering location, building codes, environmental hazards, and property-specific safeguards. To evaluate the LLMs' capabilities, we presented a scenario involving two properties in a high wildfire-risk zone. We provided eight property characteristics (e.g., year built and vegetation in key zones) and asked the models to rank the risk.
Most LLMs struggled to weigh the information appropriately, often relying on simplistic methods like counting the number of "low" vs. "high" risk factors. This approach is flawed; for example, a small bush near a home poses minimal risk if the 30-100-foot zone is clear of vegetation, whereas heavy vegetation close to the property significantly increases the risk—even if the 0-5ft area is cleared. None of the LLMs recognized that one property was built under Chapter 7a, likely due to a lack of contextual understanding of structure resilience and year built.
Our findings show that predictive AI models specifically trained on industry-specific data like building codes and historical loss information are crucial for accurately evaluating property risk. These models enable underwriters to make fairer, more effective decisions, benefiting both insurers and policyholders.

Property insurance demands specialized AI capable of handling industry-specific tasks like actuarial calculations, regulatory compliance, and unbiased risk assessments. While general LLMs like ChatGPT 4.0 and Llama 3.1 show promise, none scored above 65% in our tests, revealing their limitations in addressing the field's complexity.
Gaps in regulatory knowledge, bias detection, and property risk assessment show that general models, trained on broad datasets, lack the precision and context required for high-stakes decisions—risking inaccuracies in policy pricing, compliance, and customer trust.
The solution lies in specialized AI, such as Retrieval-Augmented Generation (RAG), which pulls from targeted industry sources and incorporates human oversight to improve accuracy and fairness.
What are the requirements for non-renewal of a homeowner’s insurance policy in Minnesota regarding the advance notice of non-renewal?
Two properties are located in California, Butte County. Which of these two properties is at higher risk of a claim?