blank

When Do Customers Leave? A Survival Analysis of Telco Churn Behavior

2026-04-25T21:01:00+00:00

1. Introduction

Survival analysis is a statistical framework used to study the time until an event occurs. In business analytics, it is especially useful when the outcome is not only whether a customer leaves, but also how long the customer remains active before leaving.

In this project, the survival analysis task was applied to the IBM Telco Customer Churn dataset. The survival setting was defined as follows:

Event: customer churn
Duration: customer tenure, measured in months
Censoring: customers who had not churned by the time of observation

Under this definition, customers with Churn = Yes are treated as observed events, while customers with Churn = No are treated as right-censored observations. This allows the analysis to reflect both customers who left and customers who were still retained at the end of the recorded period.

2. Methodology

2.1 Data preprocessing

The dataset was loaded from the CSV file Telco-Customer-Churn.csv into a standard PySpark environment using an explicit schema. This avoided reliance on automatic type inference and made the workflow reproducible outside Databricks.

Several preprocessing steps were performed before the survival analysis:

The original churn field was converted from Yes/No into a binary event variable:
- churn = 1 for customers who left
- churn = 0 for customers who remained
The tenure field was used as the survival duration and interpreted as the number of months a customer had stayed with the company.
The TotalCharges field was cleaned carefully because the dataset contains blank values. These blank entries correspond to customers with tenure = 0, so they were handled consistently during cleaning rather than dropped blindly.
A curated analysis table was then created for descriptive statistics, survival tables, and group comparisons.

2.2 Survival analysis approach

The analysis used a Kaplan-Meier style survival estimation implemented in PySpark with a discrete-time monthly approximation. Since tenure is recorded in whole months, survival probabilities were computed month by month.

For each tenure month, the analysis calculated:

the number of customers at risk at the beginning of the month
the number of churn events observed in that month
the number of censored observations in that month
the corresponding conditional hazard
the cumulative survival probability

The Kaplan-Meier estimator was then approximated as:

\[S(t) = \prod_{i \le t}\left(1 - \frac{d_i}{n_i}\right)\]

where:

$d_i$ is the number of churn events at month (i)
$n_i$ is the number of customers at risk at the start of month (i)

To understand heterogeneity in customer retention, separate survival curves and subgroup summaries were examined for meaningful customer characteristics, including:

Contract
SeniorCitizen
TechSupport

These variables were selected because they are business-relevant and plausibly associated with customer stability and churn risk.

3. Validation

The results were reviewed and verified through an iterative interaction process. Rather than accepting the first implementation mechanically, the outputs were examined carefully for internal consistency and business plausibility.

The validation process included the following checks:

survival tables were compared against the raw churn data
early-month Kaplan-Meier calculations were checked manually for consistency
final survival probabilities were cross-checked across exported result tables
subgroup results were reviewed to confirm that the direction of effects matched reasonable expectations

The overall churn counts, monthly risk sets, event counts, and survival probabilities were all found to be consistent with the underlying dataset.

4. Results

4.1 Churn distribution

The curated dataset contained 7,043 customers in total.

The churn distribution was:

No churn / censored: 5,174 customers (73.46%)
Churn event: 1,869 customers (26.54%)

This indicates that most customers were retained at the observation endpoint, but a substantial minority had churned.

Table 1 summarizes the observed churn distribution in the curated dataset.

Churn status	Customers	Percentage
No churn / censored	5,174	73.46%
Churn event	1,869	26.54%

Table 1. Churn distribution in the curated Telco customer dataset

Figure 1. Churn distribution bar chart

4.2 Overall Kaplan-Meier survival behavior

The overall Kaplan-Meier curve showed a clear decline in survival probability over time, with the steepest drop occurring in the earliest months of customer tenure.

The first few months illustrate this pattern clearly:

At month 0, survival began at 1.000
At month 1, survival dropped to about 0.946
At month 2, survival dropped further to about 0.928
At month 4, survival was about 0.901

By the end of the observation window, the estimated survival probability at month 72 was approximately 0.593.

This pattern suggests that churn risk is highest early in the customer lifecycle, after which the remaining customer base becomes relatively more stable.

Table 2 shows a representative excerpt from the monthly survival table used to compute the Kaplan-Meier curve.

Tenure month	At risk	Events	Censored	Hazard
0	7,043	0	11	0.0000
1	7,032	380	233	0.0540
2	6,419	123	115	0.0192
3	6,181	94	106	0.0152
4	5,981	83	93	0.0139
5	5,805	64	69	0.0110
6	5,672	40	70	0.0071
7	5,562	51	80	0.0092
8	5,431	42	81	0.0077
9	5,308	46	73	0.0087

Table 2. Representative rows from the overall monthly survival table

Figure 2. Overall Kaplan-Meier survival curve

4.3 Descriptive differences by churn status

Customers who churned had systematically shorter tenure and higher monthly charges than those who stayed:

Non-churned customers
- mean tenure: 37.57 months
- mean monthly charges: 61.27
- mean total charges: 2549.91
Churned customers
- mean tenure: 17.98 months
- mean monthly charges: 74.44
- mean total charges: 1531.80

These differences are consistent with the survival analysis results. Customers who leave tend to do so earlier, and they also tend to pay higher monthly rates.

Table 3 summarizes the main descriptive differences between churned and non-churned customers.

Churn group	Customers	Mean tenure (months)	SD tenure	Mean monthly charges	SD monthly charges	Mean total charges	SD total charges
No churn / censored	5,174	37.57	24.11	61.27	31.09	2549.91	2329.95
Churn event	1,869	17.98	19.53	74.44	24.67	1531.80	1890.82

Table 3. Descriptive statistics by churn status

Figure 3. Tenure distribution by churn status

4.4 Group comparison: Contract type

Contract type showed the strongest and clearest survival separation.

The estimated final survival probabilities were approximately:

Month-to-month: 0.129
One year: 0.568
Two year: 0.936

This pattern is highly interpretable. Customers on month-to-month plans experience much lower retention over time, while two-year contract customers are substantially more stable. The month-to-month curve declines sharply in the early periods, indicating strong early churn risk.

Table 4 reports the final Kaplan-Meier survival probabilities by contract type.

Contract type	Final tenure month	Final survival probability
Month-to-month	72	0.1290
One year	72	0.5681
Two year	72	0.9357

Table 4. Final survival probabilities by contract type

Figure 4. Survival curves grouped by contract type

4.5 Group comparison: SeniorCitizen

Senior status also showed a meaningful difference.

The estimated final survival probabilities were approximately:

Non-senior customers: 0.634
Senior customers: 0.421

This suggests that senior customers in this dataset experienced higher churn risk and lower long-term survival than non-senior customers.

Table 5 reports the final Kaplan-Meier survival probabilities by senior citizen status.

Senior citizen group	Final tenure month	Final survival probability
Non-senior customers	72	0.6339
Senior customers	72	0.4213

Table 5. Final Kaplan–Meier survival probabilities by senior citizen status.

Figure 5. Survival curves grouped by senior citizen status

4.6 Group comparison: TechSupport

Where subgroup summaries were available, TechSupport also appeared to be associated with retention differences.

The observed pattern was logically consistent:

customers with no tech support had the weakest retention
customers with tech support showed better retention
customers with no internet service appeared most stable

This pattern is plausible from a business perspective. Customers lacking technical support may be more vulnerable to dissatisfaction or service friction, increasing churn risk. By contrast, customers with support or with simpler service configurations tend to be more stable.

Table 6 reports the final Kaplan-Meier survival probabilities by TechSupport status.

TechSupport group	Final tenure month	Final survival probability
No	72	0.3492
No internet service	72	0.9015
Yes	72	0.7608

Table 6. Final Kaplan–Meier survival probabilities by TechSupport status.

Figure 6. Kaplan-Meier survival curves grouped by TechSupport status

At month 72, customers without tech support had the lowest estimated survival probability at 0.3492, while customers with tech support reached 0.7608. The highest survival probability, 0.9015, was observed for customers with no internet service.

These output values are consistent with the plotted curves and suggest that access to technical support is associated with stronger customer retention in this dataset.

5. Insights

Several factors appear to influence churn risk strongly.

First, contract type is the dominant factor in the survival analysis. Month-to-month customers have much lower survival than customers with one-year or two-year contracts. This suggests that commitment structure is closely tied to retention.

Second, customer age category, represented here by SeniorCitizen, is associated with lower survival among senior customers. While this does not explain why the difference exists, it signals that this segment may require more targeted retention support.

Third, service support conditions matter. Customers without technical support appear more vulnerable to churn, which may reflect unresolved service issues, weaker engagement, or lower perceived value.

From a business perspective, the results imply that churn risk is concentrated among customers with:

short tenure
flexible month-to-month contracts
weaker service support conditions
higher monthly charges

These findings suggest several practical retention strategies:

prioritize onboarding and retention efforts in the first few months
design incentives for migration from month-to-month to longer contracts
expand proactive support for at-risk service groups
monitor high-bill customers for dissatisfaction signals

6. Limitations

This analysis has several important limitations.

First, the Kaplan-Meier procedure was implemented as a discrete monthly approximation because tenure is recorded in whole months. This is appropriate for the dataset, but it is still a grouped version of survival analysis rather than a fully continuous-time formulation.

Second, the analysis did not include formal statistical hypothesis testing, such as a log-rank test. As a result, the subgroup differences can be described and visualized, but not formally tested for statistical significance within this notebook.

Third, the dataset is observational. The results identify associations between customer characteristics and retention patterns, but they should not be interpreted as causal effects.

Finally, some subgroup conclusions depend on descriptive survival comparisons rather than multivariable modeling. This means that confounding between factors may still be present.

7. Conclusion

This survival analysis shows that the IBM Telco Customer Churn dataset contains strong and interpretable retention patterns when churn is treated as the event and tenure is treated as the survival duration.

The results indicate that churn risk is highest early in the customer lifecycle and that survival differs substantially across customer groups. In particular:

month-to-month contracts are associated with the lowest survival
two-year contracts are associated with the strongest retention
senior customers show lower survival than non-senior customers
technical support status is also related to customer stability

Overall, the Kaplan-Meier analysis provides a clear and credible view of customer retention dynamics. Even without a parametric survival model, the descriptive survival framework offers useful evidence for understanding churn timing, identifying high-risk groups, and informing retention strategy.

Uncertainty Estimation Methods in Large Language Models - A Taxonomy

2025-09-19T21:01:00+00:00

Introduction

Reliability and trustworthiness are paramount challenges in the deployment of Large Language Models (LLMs). A critical component of reliable AI is Uncertainty Estimation, which aims to quantify the model’s confidence in its own generations. This post provides a systematic taxonomy of current uncertainty estimation methods, synthesizing key literature including recent surveys such as A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions and A Survey of Uncertainty Estimation Methods on Large Language Models.

The methods are categorized into five primary distinct approaches: Token Probability, Verbalized Confidence, Consistency/Ensemble-based, Structural/Graph-based, and Hidden State Probing.

Figure 1: Five primary distinct approaches - Token Probability, Verbalized Confidence, Consistency/Ensemble-based, Structural/Graph-based, and Hidden State Probing.

1. Logit-Based and Probability-Derived Methods

This category utilizes the internal probability distribution of the model (white-box access) to derive uncertainty scores. These methods rely on the output logits of tokens or the entire sequence.

Mechanism:

Metrics: Calculation of statistics such as Average/Max Probability or Average/Max Entropy across the generated token sequence.
Validation: Content correctness is often validated using similarity metrics (ExactMatch, BLEU, RougeL, Jaccard Index, BERTScore, Cosine Similarity) where a threshold (e.g., score > 0.5) implies correctness.

Limitations:

Requires white-box access to the model (access to logits).
Suffers from miscalibration (models are often confident but wrong).

Figure 2: Illustration on latent information methods. (Xia et al, 2025)

Key Literature:

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering
- Method: Uses confidence scores derived from the average/max negative log-likelihood probability and the entropy of response tokens.
Language Models (Mostly) Know What They Know
- Method: Evaluates the probability of the statement being true, denoted as $P(\text{true})$.
Uncertainty Estimation in Autoregressive Structured Prediction
- Method: Introduces Length-Normalized (LN) methods to mitigate the bias where longer sequences yield lower joint probabilities.
MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs
- Method: Improves upon standard length normalization by assigning weights to each token via BERT embeddings to capture semantic importance.

2. Verbalized Confidence (Prompt-Based)

These methods treat the LLM as a black box, leveraging prompt engineering to explicitly query the model for its confidence level or to generate reasoning paths (Chain-of-Thought) regarding its certainty.

Mechanism:

Direct Query: Prompting the model to output a numerical score or a linguistic confidence marker along with the answer.
Framework: Often involves multi-stage prompting (Answer generation $\rightarrow$ Confidence elicitation).

Limitations:

Performance is heavily dependent on prompt design.
It is challenging to improve the separation between correct and incorrect predictions solely through prompting without fine-tuning.

Figure 3: Illustration on verbalizing methods. (Xia et al, 2025)

Key Literature:

Teaching Models to Express Their Uncertainty in Words
- Method: GPT-3 is prompted to generate both the answer and a verbalized confidence level simultaneously.
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
- Method: Explores various strategies including Label Probability, “IsTrue” Probability, and Verbalized methods (1-step vs. 2-step, Top-K, CoT).
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
- Method: Proposes a 3-stage framework for black-box LLMs: (1) Prompting for confidence, (2) Sampling diverse responses, (3) Aggregation (ranking/averaging).
Quantifying Uncertainty in Natural Language Explanations of Large Language Models
- Method: Measures uncertainty in reasoning steps. Uses Token Importance Scoring (via sample probing) and Step-wise CoT Confidence (via model probing).

3. Consistency and Ensemble-Based Methods

This approach is based on the intuition that if a model is confident, multiple sampled generations should be consistent. If the model is hallucinating, the generations will likely diverge.

Mechanism:

Sampling: Generate multiple responses for the same input (or perturbed inputs).
Aggregation: Measure consistency via surface-level similarity (lexical overlap) or semantic-level similarity (clustering/embedding distance).

Limitations:

High computational cost due to multiple generation passes.
Requires complex aggregation logic (clustering or an external judge model).

3.1 Consistency without Semantic Clustering (Surface Level)

Focuses on lexical consistency or input perturbations.

Figure 4: Illustration of consistency-based methods. (Xia et al, 2025)

Key Literature:

Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling
- Method: Generates clarifications for potentially ambiguous inputs and ensembles the results.
Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models
SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models
Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

3.2 Semantic Clustering Uncertainty (Deep Level)

Focuses on the meaning of the generated text. Different phrasings with the same meaning are grouped together.

Figure 5: Illustration of semantic clustering methods. (Xia et al, 2025)

Key Literature:

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation [ICLR 2023]
- Method (SE): Introduces Semantic Entropy, grouping generations by meaning (using NLI) before calculating entropy.
Detecting Hallucinations in Large Language Models Using Semantic Entropy [Nature 2024]
- Method (DSE): A refined Dynamic Semantic Entropy approach for hallucination detection.
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs [ICLR 2025 Submission]
- Method (SeP): A probing method designed to approximate semantic entropy more efficiently.
Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models [ICLR 2024 Submission]
- Method (SAR): Focuses on relevance weighting during uncertainty quantification.
Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities [NeurIPS 2024]
- Method (KLE): Uses kernel methods to estimate entropy based on semantic similarity matrices.
INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection [ICLR 2024 Poster]
- Method: Analyzes the Eigenvalues of internal representations to detect hallucinations.
Semantically Diverse Language Generation for Uncertainty Estimation in Language Models [ICLR 2025 Poster]
- Method (SDLG): Encourages diversity during generation to better estimate the uncertainty boundary.
Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity
- Method (SNNE): Enhances semantic entropy by leveraging pairwise similarity measures.

4. Structural and Graph-Enhanced Methods

A relatively novel category that abstracts uncertainty from the syntactic or structural relationships within the generated text.

Key Literature:

GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models [EMNLP 2025]
- Method: Performs lexical and syntactic analysis on the generated answer to construct a graph. Uncertainty is derived from the structural properties and relations within this graph.

5. Hidden State Supervised Training (Probing)

This approach involves training a lightweight classifier (probe) on the model’s internal hidden states (activations) to predict correctness or uncertainty.

Mechanism:

Feature Extraction: Extracts activation vectors from specific layers.
Supervision: Trains a linear classifier (e.g., Logistic Regression or SVM) to distinguish between correct and incorrect generations.

Key Literature:

Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach
- Method: Extracts features from hidden states and uses supervised learning to train a linear classifier for uncertainty prediction.

Reproduction and Extension - Interpretable Generative Models through Post-hoc Concept Bottlenecks (CVPR 2025)

2025-01-05T12:00:00+00:00

A blog-style walkthrough of the CVPR 2025 paper on post-hoc concept bottleneck models and their interpretability.