<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://shellyleee.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://shellyleee.github.io/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-04-27T09:57:40+00:00</updated><id>https://shellyleee.github.io/feed.xml</id><title type="html">blank</title><subtitle>A simple, whitespace theme for academics. Based on [*folio](https://github.com/bogoli/-folio) design. </subtitle><entry><title type="html">When Do Customers Leave? A Survival Analysis of Telco Churn Behavior</title><link href="https://shellyleee.github.io/blog/2026/survival_analysis/" rel="alternate" type="text/html" title="When Do Customers Leave? A Survival Analysis of Telco Churn Behavior"/><published>2026-04-25T21:01:00+00:00</published><updated>2026-04-25T21:01:00+00:00</updated><id>https://shellyleee.github.io/blog/2026/survival_analysis</id><content type="html" xml:base="https://shellyleee.github.io/blog/2026/survival_analysis/"><![CDATA[<div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/survival_analysis/survival_analysis-480.webp 480w,/assets/img/survival_analysis/survival_analysis-800.webp 800w,/assets/img/survival_analysis/survival_analysis-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/survival_analysis/survival_analysis.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="survival_analysis" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <h2 id="1-introduction">1. Introduction</h2> <p>Survival analysis is a statistical framework used to study the time until an event occurs. In business analytics, it is especially useful when the outcome is not only whether a customer leaves, but also how long the customer remains active before leaving.</p> <p>In this project, the survival analysis task was applied to the <a href="https://github.com/IBM/telco-customer-churn-on-icp4d/blob/master/data/Telco-Customer-Churn.csv">IBM Telco Customer Churn dataset</a>. The survival setting was defined as follows:</p> <ul> <li><strong>Event</strong>: customer churn</li> <li><strong>Duration</strong>: customer tenure, measured in months</li> <li><strong>Censoring</strong>: customers who had not churned by the time of observation</li> </ul> <p>Under this definition, customers with <code class="language-plaintext highlighter-rouge">Churn = Yes</code> are treated as observed events, while customers with <code class="language-plaintext highlighter-rouge">Churn = No</code> are treated as right-censored observations. This allows the analysis to reflect both customers who left and customers who were still retained at the end of the recorded period.</p> <hr/> <h2 id="2-methodology">2. Methodology</h2> <h3 id="21-data-preprocessing">2.1 Data preprocessing</h3> <p>The dataset was loaded from the CSV file <code class="language-plaintext highlighter-rouge">Telco-Customer-Churn.csv</code> into a standard PySpark environment using an explicit schema. This avoided reliance on automatic type inference and made the workflow reproducible outside Databricks.</p> <p>Several preprocessing steps were performed before the survival analysis:</p> <ul> <li>The original churn field was converted from <code class="language-plaintext highlighter-rouge">Yes/No</code> into a binary event variable: <ul> <li><code class="language-plaintext highlighter-rouge">churn = 1</code> for customers who left</li> <li><code class="language-plaintext highlighter-rouge">churn = 0</code> for customers who remained</li> </ul> </li> <li>The <code class="language-plaintext highlighter-rouge">tenure</code> field was used as the survival duration and interpreted as the number of months a customer had stayed with the company.</li> <li>The <code class="language-plaintext highlighter-rouge">TotalCharges</code> field was cleaned carefully because the dataset contains blank values. These blank entries correspond to customers with <code class="language-plaintext highlighter-rouge">tenure = 0</code>, so they were handled consistently during cleaning rather than dropped blindly.</li> <li>A curated analysis table was then created for descriptive statistics, survival tables, and group comparisons.</li> </ul> <h3 id="22-survival-analysis-approach">2.2 Survival analysis approach</h3> <p>The analysis used a Kaplan-Meier style survival estimation implemented in PySpark with a discrete-time monthly approximation. Since <code class="language-plaintext highlighter-rouge">tenure</code> is recorded in whole months, survival probabilities were computed month by month.</p> <p>For each tenure month, the analysis calculated:</p> <ul> <li>the number of customers at risk at the beginning of the month</li> <li>the number of churn events observed in that month</li> <li>the number of censored observations in that month</li> <li>the corresponding conditional hazard</li> <li>the cumulative survival probability</li> </ul> <p>The Kaplan-Meier estimator was then approximated as:</p> \[S(t) = \prod_{i \le t}\left(1 - \frac{d_i}{n_i}\right)\] <p>where:</p> <ul> <li>$d_i$ is the number of churn events at month (i)</li> <li>$n_i$ is the number of customers at risk at the start of month (i)</li> </ul> <p>To understand heterogeneity in customer retention, separate survival curves and subgroup summaries were examined for meaningful customer characteristics, including:</p> <ul> <li><code class="language-plaintext highlighter-rouge">Contract</code></li> <li><code class="language-plaintext highlighter-rouge">SeniorCitizen</code></li> <li><code class="language-plaintext highlighter-rouge">TechSupport</code></li> </ul> <p>These variables were selected because they are business-relevant and plausibly associated with customer stability and churn risk.</p> <hr/> <h2 id="3-validation">3. Validation</h2> <p>The results were reviewed and verified through an iterative interaction process. Rather than accepting the first implementation mechanically, the outputs were examined carefully for internal consistency and business plausibility.</p> <p>The validation process included the following checks:</p> <ul> <li>survival tables were compared against the raw churn data</li> <li>early-month Kaplan-Meier calculations were checked manually for consistency</li> <li>final survival probabilities were cross-checked across exported result tables</li> <li>subgroup results were reviewed to confirm that the direction of effects matched reasonable expectations</li> </ul> <p>The overall churn counts, monthly risk sets, event counts, and survival probabilities were all found to be consistent with the underlying dataset.</p> <hr/> <h2 id="4-results">4. Results</h2> <h3 id="41-churn-distribution">4.1 Churn distribution</h3> <p>The curated dataset contained <strong>7,043</strong> customers in total.</p> <p>The churn distribution was:</p> <ul> <li><strong>No churn / censored</strong>: 5,174 customers (<strong>73.46%</strong>)</li> <li><strong>Churn event</strong>: 1,869 customers (<strong>26.54%</strong>)</li> </ul> <p>This indicates that most customers were retained at the observation endpoint, but a substantial minority had churned.</p> <p>Table 1 summarizes the observed churn distribution in the curated dataset.</p> <div class="table-responsive"> <table class="table table-hover table-sm"> <thead> <tr> <th>Churn status</th> <th class="text-end">Customers</th> <th class="text-end">Percentage</th> </tr> </thead> <tbody> <tr> <td>No churn / censored</td> <td class="text-end">5,174</td> <td class="text-end">73.46%</td> </tr> <tr> <td>Churn event</td> <td class="text-end">1,869</td> <td class="text-end">26.54%</td> </tr> </tbody> </table> </div> <div class="caption"> Table 1. Churn distribution in the curated Telco customer dataset </div> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/survival_analysis/churn_distribution-480.webp 480w,/assets/img/survival_analysis/churn_distribution-800.webp 800w,/assets/img/survival_analysis/churn_distribution-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/survival_analysis/churn_distribution.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="churn_distribution" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 1. Churn distribution bar chart </div> <h3 id="42-overall-kaplan-meier-survival-behavior">4.2 Overall Kaplan-Meier survival behavior</h3> <p>The overall Kaplan-Meier curve showed a clear decline in survival probability over time, with the steepest drop occurring in the earliest months of customer tenure.</p> <p>The first few months illustrate this pattern clearly:</p> <ul> <li>At month 0, survival began at 1.000</li> <li>At month 1, survival dropped to about <strong>0.946</strong></li> <li>At month 2, survival dropped further to about <strong>0.928</strong></li> <li>At month 4, survival was about <strong>0.901</strong></li> </ul> <p>By the end of the observation window, the estimated survival probability at month 72 was approximately <strong>0.593</strong>.</p> <p>This pattern suggests that churn risk is highest early in the customer lifecycle, after which the remaining customer base becomes relatively more stable.</p> <p>Table 2 shows a representative excerpt from the monthly survival table used to compute the Kaplan-Meier curve.</p> <div class="table-responsive rounded z-depth-1 p-2 mt-3 mb-2"> <table class="table table-hover table-sm mb-0"> <thead> <tr> <th>Tenure month</th> <th class="text-end">At risk</th> <th class="text-end">Events</th> <th class="text-end">Censored</th> <th class="text-end">Hazard</th> </tr> </thead> <tbody> <tr><td>0</td><td class="text-end">7,043</td><td class="text-end">0</td><td class="text-end">11</td><td class="text-end">0.0000</td></tr> <tr><td>1</td><td class="text-end">7,032</td><td class="text-end">380</td><td class="text-end">233</td><td class="text-end">0.0540</td></tr> <tr><td>2</td><td class="text-end">6,419</td><td class="text-end">123</td><td class="text-end">115</td><td class="text-end">0.0192</td></tr> <tr><td>3</td><td class="text-end">6,181</td><td class="text-end">94</td><td class="text-end">106</td><td class="text-end">0.0152</td></tr> <tr><td>4</td><td class="text-end">5,981</td><td class="text-end">83</td><td class="text-end">93</td><td class="text-end">0.0139</td></tr> <tr><td>5</td><td class="text-end">5,805</td><td class="text-end">64</td><td class="text-end">69</td><td class="text-end">0.0110</td></tr> <tr><td>6</td><td class="text-end">5,672</td><td class="text-end">40</td><td class="text-end">70</td><td class="text-end">0.0071</td></tr> <tr><td>7</td><td class="text-end">5,562</td><td class="text-end">51</td><td class="text-end">80</td><td class="text-end">0.0092</td></tr> <tr><td>8</td><td class="text-end">5,431</td><td class="text-end">42</td><td class="text-end">81</td><td class="text-end">0.0077</td></tr> <tr><td>9</td><td class="text-end">5,308</td><td class="text-end">46</td><td class="text-end">73</td><td class="text-end">0.0087</td></tr> </tbody> </table> </div> <div class="caption"> Table 2. Representative rows from the overall monthly survival table </div> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/survival_analysis/km_curve-480.webp 480w,/assets/img/survival_analysis/km_curve-800.webp 800w,/assets/img/survival_analysis/km_curve-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/survival_analysis/km_curve.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="km_curve" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 2. Overall Kaplan-Meier survival curve </div> <h3 id="43-descriptive-differences-by-churn-status">4.3 Descriptive differences by churn status</h3> <p>Customers who churned had systematically shorter tenure and higher monthly charges than those who stayed:</p> <ul> <li><strong>Non-churned customers</strong> <ul> <li>mean tenure: <strong>37.57 months</strong></li> <li>mean monthly charges: <strong>61.27</strong></li> <li>mean total charges: <strong>2549.91</strong></li> </ul> </li> <li><strong>Churned customers</strong> <ul> <li>mean tenure: <strong>17.98 months</strong></li> <li>mean monthly charges: <strong>74.44</strong></li> <li>mean total charges: <strong>1531.80</strong></li> </ul> </li> </ul> <p>These differences are consistent with the survival analysis results. Customers who leave tend to do so earlier, and they also tend to pay higher monthly rates.</p> <p>Table 3 summarizes the main descriptive differences between churned and non-churned customers.</p> <div class="table-responsive rounded z-depth-1 p-2 mt-3 mb-2"> <table class="table table-hover table-sm mb-0"> <thead> <tr> <th>Churn group</th> <th class="text-end">Customers</th> <th class="text-end">Mean tenure (months)</th> <th class="text-end">SD tenure</th> <th class="text-end">Mean monthly charges</th> <th class="text-end">SD monthly charges</th> <th class="text-end">Mean total charges</th> <th class="text-end">SD total charges</th> </tr> </thead> <tbody> <tr> <td>No churn / censored</td> <td class="text-end">5,174</td> <td class="text-end">37.57</td> <td class="text-end">24.11</td> <td class="text-end">61.27</td> <td class="text-end">31.09</td> <td class="text-end">2549.91</td> <td class="text-end">2329.95</td> </tr> <tr> <td>Churn event</td> <td class="text-end">1,869</td> <td class="text-end">17.98</td> <td class="text-end">19.53</td> <td class="text-end">74.44</td> <td class="text-end">24.67</td> <td class="text-end">1531.80</td> <td class="text-end">1890.82</td> </tr> </tbody> </table> </div> <div class="caption"> Table 3. Descriptive statistics by churn status </div> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/survival_analysis/tenure_distribution_by_churn-480.webp 480w,/assets/img/survival_analysis/tenure_distribution_by_churn-800.webp 800w,/assets/img/survival_analysis/tenure_distribution_by_churn-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/survival_analysis/tenure_distribution_by_churn.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="tenure_distribution_by_churn" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 3. Tenure distribution by churn status </div> <h3 id="44-group-comparison-contract-type">4.4 Group comparison: Contract type</h3> <p>Contract type showed the strongest and clearest survival separation.</p> <p>The estimated final survival probabilities were approximately:</p> <ul> <li><strong>Month-to-month</strong>: <strong>0.129</strong></li> <li><strong>One year</strong>: <strong>0.568</strong></li> <li><strong>Two year</strong>: <strong>0.936</strong></li> </ul> <p>This pattern is highly interpretable. Customers on month-to-month plans experience much lower retention over time, while two-year contract customers are substantially more stable. The month-to-month curve declines sharply in the early periods, indicating strong early churn risk.</p> <p>Table 4 reports the final Kaplan-Meier survival probabilities by contract type.</p> <div class="table-responsive rounded z-depth-1 p-2 mt-3 mb-2"> <table class="table table-hover table-sm mb-0"> <thead> <tr> <th>Contract type</th> <th class="text-end">Final tenure month</th> <th class="text-end">Final survival probability</th> </tr> </thead> <tbody> <tr> <td>Month-to-month</td> <td class="text-end">72</td> <td class="text-end">0.1290</td> </tr> <tr> <td>One year</td> <td class="text-end">72</td> <td class="text-end">0.5681</td> </tr> <tr> <td>Two year</td> <td class="text-end">72</td> <td class="text-end">0.9357</td> </tr> </tbody> </table> </div> <div class="caption"> Table 4. Final survival probabilities by contract type </div> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/survival_analysis/contract_survival-480.webp 480w,/assets/img/survival_analysis/contract_survival-800.webp 800w,/assets/img/survival_analysis/contract_survival-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/survival_analysis/contract_survival.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="contract_survival" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 4. Survival curves grouped by contract type </div> <h3 id="45-group-comparison-seniorcitizen">4.5 Group comparison: SeniorCitizen</h3> <p>Senior status also showed a meaningful difference.</p> <p>The estimated final survival probabilities were approximately:</p> <ul> <li><strong>Non-senior customers</strong>: <strong>0.634</strong></li> <li><strong>Senior customers</strong>: <strong>0.421</strong></li> </ul> <p>This suggests that senior customers in this dataset experienced higher churn risk and lower long-term survival than non-senior customers.</p> <p>Table 5 reports the final Kaplan-Meier survival probabilities by senior citizen status.</p> <div class="table-responsive rounded z-depth-1 p-2 mt-3 mb-2"> <table class="table table-hover table-sm mb-0"> <thead> <tr> <th>Senior citizen group</th> <th class="text-end">Final tenure month</th> <th class="text-end">Final survival probability</th> </tr> </thead> <tbody> <tr> <td>Non-senior customers</td> <td class="text-end">72</td> <td class="text-end">0.6339</td> </tr> <tr> <td>Senior customers</td> <td class="text-end">72</td> <td class="text-end">0.4213</td> </tr> </tbody> </table> </div> <div class="caption"> Table 5. Final Kaplan–Meier survival probabilities by senior citizen status. </div> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/survival_analysis/senior_survival-480.webp 480w,/assets/img/survival_analysis/senior_survival-800.webp 800w,/assets/img/survival_analysis/senior_survival-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/survival_analysis/senior_survival.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="senior_survival" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 5. Survival curves grouped by senior citizen status </div> <h3 id="46-group-comparison-techsupport">4.6 Group comparison: TechSupport</h3> <p>Where subgroup summaries were available, <code class="language-plaintext highlighter-rouge">TechSupport</code> also appeared to be associated with retention differences.</p> <p>The observed pattern was logically consistent:</p> <ul> <li>customers with <strong>no tech support</strong> had the weakest retention</li> <li>customers with <strong>tech support</strong> showed better retention</li> <li>customers with <strong>no internet service</strong> appeared most stable</li> </ul> <p>This pattern is plausible from a business perspective. Customers lacking technical support may be more vulnerable to dissatisfaction or service friction, increasing churn risk. By contrast, customers with support or with simpler service configurations tend to be more stable.</p> <p>Table 6 reports the final Kaplan-Meier survival probabilities by <code class="language-plaintext highlighter-rouge">TechSupport</code> status.</p> <div class="table-responsive rounded z-depth-1 p-2 mt-3 mb-2"> <table class="table table-hover table-sm mb-0"> <thead> <tr> <th>TechSupport group</th> <th class="text-end">Final tenure month</th> <th class="text-end">Final survival probability</th> </tr> </thead> <tbody> <tr> <td>No</td> <td class="text-end">72</td> <td class="text-end">0.3492</td> </tr> <tr> <td>No internet service</td> <td class="text-end">72</td> <td class="text-end">0.9015</td> </tr> <tr> <td>Yes</td> <td class="text-end">72</td> <td class="text-end">0.7608</td> </tr> </tbody> </table> </div> <div class="caption"> Table 6. Final Kaplan–Meier survival probabilities by TechSupport status. </div> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/survival_analysis/techsupport_survival-480.webp 480w,/assets/img/survival_analysis/techsupport_survival-800.webp 800w,/assets/img/survival_analysis/techsupport_survival-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/survival_analysis/techsupport_survival.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="techsupport_survival" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 6. Kaplan-Meier survival curves grouped by TechSupport status </div> <p>At month 72, customers without tech support had the lowest estimated survival probability at <strong>0.3492</strong>, while customers with tech support reached <strong>0.7608</strong>. The highest survival probability, <strong>0.9015</strong>, was observed for customers with no internet service.</p> <p>These output values are consistent with the plotted curves and suggest that access to technical support is associated with stronger customer retention in this dataset.</p> <hr/> <h2 id="5-insights">5. Insights</h2> <p>Several factors appear to influence churn risk strongly.</p> <p>First, <strong>contract type</strong> is the dominant factor in the survival analysis. Month-to-month customers have much lower survival than customers with one-year or two-year contracts. This suggests that commitment structure is closely tied to retention.</p> <p>Second, <strong>customer age category</strong>, represented here by <code class="language-plaintext highlighter-rouge">SeniorCitizen</code>, is associated with lower survival among senior customers. While this does not explain why the difference exists, it signals that this segment may require more targeted retention support.</p> <p>Third, <strong>service support conditions</strong> matter. Customers without technical support appear more vulnerable to churn, which may reflect unresolved service issues, weaker engagement, or lower perceived value.</p> <p>From a business perspective, the results imply that churn risk is concentrated among customers with:</p> <ul> <li>short tenure</li> <li>flexible month-to-month contracts</li> <li>weaker service support conditions</li> <li>higher monthly charges</li> </ul> <p>These findings suggest several practical retention strategies:</p> <ul> <li>prioritize onboarding and retention efforts in the first few months</li> <li>design incentives for migration from month-to-month to longer contracts</li> <li>expand proactive support for at-risk service groups</li> <li>monitor high-bill customers for dissatisfaction signals</li> </ul> <hr/> <h2 id="6-limitations">6. Limitations</h2> <p>This analysis has several important limitations.</p> <p>First, the Kaplan-Meier procedure was implemented as a <strong>discrete monthly approximation</strong> because tenure is recorded in whole months. This is appropriate for the dataset, but it is still a grouped version of survival analysis rather than a fully continuous-time formulation.</p> <p>Second, the analysis did <strong>not include formal statistical hypothesis testing</strong>, such as a log-rank test. As a result, the subgroup differences can be described and visualized, but not formally tested for statistical significance within this notebook.</p> <p>Third, the dataset is <strong>observational</strong>. The results identify associations between customer characteristics and retention patterns, but they should not be interpreted as causal effects.</p> <p>Finally, some subgroup conclusions depend on descriptive survival comparisons rather than multivariable modeling. This means that confounding between factors may still be present.</p> <hr/> <h2 id="7-conclusion">7. Conclusion</h2> <p>This survival analysis shows that the IBM Telco Customer Churn dataset contains strong and interpretable retention patterns when churn is treated as the event and tenure is treated as the survival duration.</p> <p>The results indicate that churn risk is highest early in the customer lifecycle and that survival differs substantially across customer groups. In particular:</p> <ul> <li>month-to-month contracts are associated with the lowest survival</li> <li>two-year contracts are associated with the strongest retention</li> <li>senior customers show lower survival than non-senior customers</li> <li>technical support status is also related to customer stability</li> </ul> <p>Overall, the Kaplan-Meier analysis provides a clear and credible view of customer retention dynamics. Even without a parametric survival model, the descriptive survival framework offers useful evidence for understanding churn timing, identifying high-risk groups, and informing retention strategy.</p>]]></content><author><name></name></author><category term="research-blog"/><category term="research"/><summary type="html"><![CDATA[Explore customer churn through a survival analysis perspective using the IBM Telco dataset. By treating churn as a time-to-event problem, we uncover how contract types, technical support, and customer characteristics influence retention over time.]]></summary></entry><entry><title type="html">Uncertainty Estimation Methods in Large Language Models - A Taxonomy</title><link href="https://shellyleee.github.io/blog/2025/UQ/" rel="alternate" type="text/html" title="Uncertainty Estimation Methods in Large Language Models - A Taxonomy"/><published>2025-09-19T21:01:00+00:00</published><updated>2025-09-19T21:01:00+00:00</updated><id>https://shellyleee.github.io/blog/2025/UQ</id><content type="html" xml:base="https://shellyleee.github.io/blog/2025/UQ/"><![CDATA[<h3 id="introduction">Introduction</h3> <p>Reliability and trustworthiness are paramount challenges in the deployment of Large Language Models (LLMs). A critical component of reliable AI is <strong>Uncertainty Estimation</strong>, which aims to quantify the model’s confidence in its own generations. This post provides a systematic taxonomy of current uncertainty estimation methods, synthesizing key literature including recent surveys such as <a href="https://arxiv.org/abs/2412.05563">A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions</a> and <a href="https://arxiv.org/abs/2503.00172">A Survey of Uncertainty Estimation Methods on Large Language Models</a>.</p> <p>The methods are categorized into five primary distinct approaches: Token Probability, Verbalized Confidence, Consistency/Ensemble-based, Structural/Graph-based, and Hidden State Probing.</p> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/UQ-480.webp 480w,/assets/img/UQ-800.webp 800w,/assets/img/UQ-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/UQ.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="UQ image" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 1: Five primary distinct approaches - Token Probability, Verbalized Confidence, Consistency/Ensemble-based, Structural/Graph-based, and Hidden State Probing. </div> <hr/> <h3 id="1-logit-based-and-probability-derived-methods">1. Logit-Based and Probability-Derived Methods</h3> <p>This category utilizes the internal probability distribution of the model (white-box access) to derive uncertainty scores. These methods rely on the output logits of tokens or the entire sequence.</p> <p><strong>Mechanism:</strong></p> <ul> <li><strong>Metrics:</strong> Calculation of statistics such as Average/Max Probability or Average/Max Entropy across the generated token sequence.</li> <li><strong>Validation:</strong> Content correctness is often validated using similarity metrics (ExactMatch, BLEU, RougeL, Jaccard Index, BERTScore, Cosine Similarity) where a threshold (e.g., score &gt; 0.5) implies correctness.</li> </ul> <p><strong>Limitations:</strong></p> <ul> <li>Requires white-box access to the model (access to logits).</li> <li>Suffers from miscalibration (models are often confident but wrong).</li> </ul> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/latent-480.webp 480w,/assets/img/latent-800.webp 800w,/assets/img/latent-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/latent.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="latent" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 2: Illustration on latent information methods. (Xia et al, 2025) </div> <p><strong>Key Literature:</strong></p> <ul> <li><a href="https://arxiv.org/abs/2012.00955">How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering</a> <ul> <li><em>Method:</em> Uses confidence scores derived from the average/max negative log-likelihood probability and the entropy of response tokens.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2207.05221">Language Models (Mostly) Know What They Know</a> <ul> <li><em>Method:</em> Evaluates the probability of the statement being true, denoted as $P(\text{true})$.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2002.07650">Uncertainty Estimation in Autoregressive Structured Prediction</a> <ul> <li><em>Method:</em> Introduces Length-Normalized (LN) methods to mitigate the bias where longer sequences yield lower joint probabilities.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2402.11756">MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs</a> <ul> <li><em>Method:</em> Improves upon standard length normalization by assigning weights to each token via BERT embeddings to capture semantic importance.</li> </ul> </li> </ul> <hr/> <h3 id="2-verbalized-confidence-prompt-based">2. Verbalized Confidence (Prompt-Based)</h3> <p>These methods treat the LLM as a black box, leveraging prompt engineering to explicitly query the model for its confidence level or to generate reasoning paths (Chain-of-Thought) regarding its certainty.</p> <p><strong>Mechanism:</strong></p> <ul> <li><strong>Direct Query:</strong> Prompting the model to output a numerical score or a linguistic confidence marker along with the answer.</li> <li><strong>Framework:</strong> Often involves multi-stage prompting (Answer generation $\rightarrow$ Confidence elicitation).</li> </ul> <p><strong>Limitations:</strong></p> <ul> <li>Performance is heavily dependent on prompt design.</li> <li>It is challenging to improve the separation between correct and incorrect predictions solely through prompting without fine-tuning.</li> </ul> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/verbalize-480.webp 480w,/assets/img/verbalize-800.webp 800w,/assets/img/verbalize-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/verbalize.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="verbalize" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 3: Illustration on verbalizing methods. (Xia et al, 2025) </div> <p><strong>Key Literature:</strong></p> <ul> <li><a href="https://arxiv.org/abs/2205.14334">Teaching Models to Express Their Uncertainty in Words</a> <ul> <li><em>Method:</em> GPT-3 is prompted to generate both the answer and a verbalized confidence level simultaneously.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2305.14975">Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback</a> <ul> <li><em>Method:</em> Explores various strategies including Label Probability, “IsTrue” Probability, and Verbalized methods (1-step vs. 2-step, Top-K, CoT).</li> </ul> </li> <li><a href="https://arxiv.org/abs/2306.13063">Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs</a> <ul> <li><em>Method:</em> Proposes a 3-stage framework for black-box LLMs: (1) Prompting for confidence, (2) Sampling diverse responses, (3) Aggregation (ranking/averaging).</li> </ul> </li> <li><a href="https://www.google.com/search?q=https://openreview.net/forum%3Fid%3DYd2S8flZKm">Quantifying Uncertainty in Natural Language Explanations of Large Language Models</a> <ul> <li><em>Method:</em> Measures uncertainty in reasoning steps. Uses <em>Token Importance Scoring</em> (via sample probing) and <em>Step-wise CoT Confidence</em> (via model probing).</li> </ul> </li> </ul> <hr/> <h3 id="3-consistency-and-ensemble-based-methods">3. Consistency and Ensemble-Based Methods</h3> <p>This approach is based on the intuition that if a model is confident, multiple sampled generations should be consistent. If the model is hallucinating, the generations will likely diverge.</p> <p><strong>Mechanism:</strong></p> <ul> <li><strong>Sampling:</strong> Generate multiple responses for the same input (or perturbed inputs).</li> <li><strong>Aggregation:</strong> Measure consistency via surface-level similarity (lexical overlap) or semantic-level similarity (clustering/embedding distance).</li> </ul> <p><strong>Limitations:</strong></p> <ul> <li>High computational cost due to multiple generation passes.</li> <li>Requires complex aggregation logic (clustering or an external judge model).</li> </ul> <h4 id="31-consistency-without-semantic-clustering-surface-level">3.1 Consistency without Semantic Clustering (Surface Level)</h4> <p>Focuses on lexical consistency or input perturbations.</p> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/confidence-480.webp 480w,/assets/img/confidence-800.webp 800w,/assets/img/confidence-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/confidence.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="confidence" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 4: Illustration of consistency-based methods. (Xia et al, 2025) </div> <p><strong>Key Literature:</strong></p> <ul> <li><a href="https://arxiv.org/abs/2311.08718">Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling</a> <ul> <li><em>Method:</em> Generates clarifications for potentially ambiguous inputs and ensembles the results.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2307.10236">Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models</a></li> <li><a href="https://arxiv.org/abs/2403.02509">SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models</a></li> <li><a href="https://arxiv.org/abs/2305.19187">Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models</a></li> </ul> <h4 id="32-semantic-clustering-uncertainty-deep-level">3.2 Semantic Clustering Uncertainty (Deep Level)</h4> <p>Focuses on the <em>meaning</em> of the generated text. Different phrasings with the same meaning are grouped together.</p> <div class="row"> <div class="col-sm mt-3 mt-md-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/semantic-480.webp 480w,/assets/img/semantic-800.webp 800w,/assets/img/semantic-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/semantic.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" title="semantic" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 5: Illustration of semantic clustering methods. (Xia et al, 2025) </div> <p><strong>Key Literature:</strong></p> <ul> <li><a href="https://arxiv.org/abs/2302.09664">Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation</a> [ICLR 2023] <ul> <li><em>Method (SE):</em> Introduces Semantic Entropy, grouping generations by meaning (using NLI) before calculating entropy.</li> </ul> </li> <li><a href="https://www.nature.com/articles/s41586-024-07421-0">Detecting Hallucinations in Large Language Models Using Semantic Entropy</a> [Nature 2024] <ul> <li><em>Method (DSE):</em> A refined Dynamic Semantic Entropy approach for hallucination detection.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2406.15927">Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs</a> [ICLR 2025 Submission] <ul> <li><em>Method (SeP):</em> A probing method designed to approximate semantic entropy more efficiently.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2307.01379">Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models</a> [ICLR 2024 Submission] <ul> <li><em>Method (SAR):</em> Focuses on relevance weighting during uncertainty quantification.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2405.20003">Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities</a> [NeurIPS 2024] <ul> <li><em>Method (KLE):</em> Uses kernel methods to estimate entropy based on semantic similarity matrices.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2402.03744">INSIDE: LLMs’ Internal States Retain the Power of Hallucination Detection</a> [ICLR 2024 Poster] <ul> <li><em>Method:</em> Analyzes the Eigenvalues of internal representations to detect hallucinations.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2406.04306">Semantically Diverse Language Generation for Uncertainty Estimation in Language Models</a> [ICLR 2025 Poster] <ul> <li><em>Method (SDLG):</em> Encourages diversity during generation to better estimate the uncertainty boundary.</li> </ul> </li> <li><a href="https://arxiv.org/abs/2506.00245">Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity</a> <ul> <li><em>Method (SNNE):</em> Enhances semantic entropy by leveraging pairwise similarity measures.</li> </ul> </li> </ul> <hr/> <h3 id="4-structural-and-graph-enhanced-methods">4. Structural and Graph-Enhanced Methods</h3> <p>A relatively novel category that abstracts uncertainty from the syntactic or structural relationships within the generated text.</p> <p><strong>Key Literature:</strong></p> <ul> <li><a href="https://arxiv.org/abs/2509.07925">GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models</a> [EMNLP 2025] <ul> <li><em>Method:</em> Performs lexical and syntactic analysis on the generated answer to construct a graph. Uncertainty is derived from the structural properties and relations within this graph.</li> </ul> </li> </ul> <hr/> <h3 id="5-hidden-state-supervised-training-probing">5. Hidden State Supervised Training (Probing)</h3> <p>This approach involves training a lightweight classifier (probe) on the model’s internal hidden states (activations) to predict correctness or uncertainty.</p> <p><strong>Mechanism:</strong></p> <ul> <li><strong>Feature Extraction:</strong> Extracts activation vectors from specific layers.</li> <li><strong>Supervision:</strong> Trains a linear classifier (e.g., Logistic Regression or SVM) to distinguish between correct and incorrect generations.</li> </ul> <p><strong>Key Literature:</strong></p> <ul> <li><a href="https://arxiv.org/abs/2404.15993">Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach</a> <ul> <li><em>Method:</em> Extracts features from hidden states and uses supervised learning to train a linear classifier for uncertainty prediction.</li> </ul> </li> </ul>]]></content><author><name></name></author><category term="research-blog"/><category term="research"/><summary type="html"><![CDATA[A systematic taxonomy of uncertainty estimation methods for Large Language Models, categorizing key literature from token-level probabilities to semantic clustering and internal state probing.]]></summary></entry><entry><title type="html">Reproduction and Extension - Interpretable Generative Models through Post-hoc Concept Bottlenecks (CVPR 2025)</title><link href="https://shellyleee.github.io/blog/2025/IGMCBM/" rel="alternate" type="text/html" title="Reproduction and Extension - Interpretable Generative Models through Post-hoc Concept Bottlenecks (CVPR 2025)"/><published>2025-01-05T12:00:00+00:00</published><updated>2025-01-05T12:00:00+00:00</updated><id>https://shellyleee.github.io/blog/2025/IGMCBM</id><content type="html" xml:base="https://shellyleee.github.io/blog/2025/IGMCBM/"><![CDATA[<p>This blog is hosted on Medium. You will be redirected automatically.</p>]]></content><author><name></name></author><category term="research-blog"/><category term="research"/><summary type="html"><![CDATA[A blog-style walkthrough of the CVPR 2025 paper on post-hoc concept bottleneck models and their interpretability.]]></summary></entry></feed>