Failure analysis for improved yield

Previously published in Semiconductor Magazine, January 2003.

Failure analysis was once limited to chips for military and aerospace applications, which often have extensive testing and certification requirements. As general purpose chips become more valuable and the cost of field returns increases, more chip and system manufacturers are turning to failure analysis as the last line of defense against poor performance. According to Howard Dicken of the analysis firm DM Data Inc., a properly conducted failure analysis can save hundreds of thousands of dollars in field returns, lost production, and customer good will.

Why good chips go bad
Deconstructing chip failures
Circuit navigation

Despite the best efforts of process engineers, no fab ever achieves one hundred percent yield. Some chips will contain defects. Some of those defects will escape notice during inline inspections and reach the end of the process line. Some defective chips will even slip through electrical test screens and reach customers.

Some of these defective chips escape because the inspection and test regime simply didn't catch them. Often, products are released to production while the manufacturing process is still relatively immature. While mature products may experience only one or two hundred defects per million parts, FEI's Craig Bennett explained, newer products may have defect levels in the thousands per million parts. Fabs are willing to tolerate less than optimum yield and accept the increased risk of field failures in order to meet a particular market window.

Chips that fail at electrical test or at the customer's incoming inspection represent yield loss. That is, they never worked and only avoided being caught sooner through a sampling quirk. A second class of chips worked at one time, having passed electrical test and been soldered onto a circuit board, but then ceased to work for an unknown reason. Though failure analysis labs deal with both groups of chips, the two groups generally present different failure mechanisms. Chips which never worked succumb to all the process defects that inspection and test screens are supposed to catch, such as contamination, unetched vias, and incomplete wire deposition. Particularly in immature processes, chips can also fail because they lie outside the acceptable process margin. For example, a circuit might be able to tolerate 40% unlanded vias, but fail at 50%. A temporary process fluctuation that increases the percentage of unlanded vias by only a small amount can lead to yield loss. As smaller feature sizes reduce process windows, failures due to process marginality become more common.

Back to top

Why good chips go bad

Failure analysts should determine both the specific failure mechanism and the underlying cause. Only by addressing both questions, Dicken said, can the analysis help prevent additional failures. For example, electrostatic discharge (ESD) and electrical overstress (EOS) are blamed for up to 60% of all field failures. ESD failures are usually due to a brief, high intensity static charge, often caused when an improperly grounded human or machine handles the chip. EOS failures typically involve long term low intensity stress such as might arise if a part consistently ran at a higher voltage or clock speed than intended. Improved handling or increased surge protection can help prevent ESD failures. EOS failures might indicate that the part's specifications are not appropriate for the application. Without understanding the root cause, it's impossible to address the problem.

ESD prevention is especially difficult with RF chips, for which two of the pins must serve as antennae to receive signals. It's difficult to isolate the chip from ESD while still allowing it to receive RF signals. As microprocessor clock speeds increase, MPUs take on some characteristics of RF chips and require better ESD isolation.

According to Dicken, contamination is the next leading cause of circuit failure, accounting for 5-10% of all failures. Some contamination takes place either during wafer fabrication or at the packaging plant. Environmental contamination can also occur after the chip is packaged. All plastic molding compounds absorb water to some extent. This moisture can reach the die and cause corrosion or ionic contamination. If a package delamination or void already exists, moisture accumulates there. Rapid evaporation in the reflow oven can cause popcorn cracking of the package or die. Sonoscan president Larry Kessler noted that popcorn cracking is more common with surface mounts than with through-hole mounts because the surface mount reflow oven heats the entire part.

Thermal expansion mismatch between the silicon die and plastic packaging components is one important cause of package delamination. Delamination can cause additional failures if the silicon expands but the delaminated layers do not. As the silicon expands and contracts, wire bonds might be stretched or pulled, work-hardening or breaking the bonds. Dicken estimates that mechanical problems account for about 5% of failures. As bond pitches become smaller and leadframes give way to ball grid arrays, molding and underfill compounds must flow through smaller spaces. Packaging issues related to these materials are becoming more common.

The electrical characteristics of the chip, the circuit board, or the application are responsible for between five and ten percent of failures. The part might be placed too close to other components on the same board, leading to latch-up. Dicken explained that latch-up is becoming a more serious problem as shrinking feature sizes make parts more sensitive to signal integrity faults.

Between five and ten percent of failures are due to wear out mechanisms like metal migration or oxide breakdown. Still, such failures might not occur in the absence of temperature or other unusual stress. For example, high current density can cause electromigration. The high current density might result when a particular application exercises a given signal line continuously for an extended period. Under more moderate stress, chips in space and power plant installations have kept working for thirty years or more.

Though a chip may already be defective, often environmental and design factors contribute to the actual failure. Temperature and moisture create a hostile environment, exacerbating any existing problems. Often, the design of the system or application may exercise the chip in unexpected ways. Tom Paquette, president of Insight Analytical Labs, observed that via instability as the part warms up is a common problem.

Back to top

Deconstructing chip failures

Regardless of whether a particular chip represents a yield issue, an infant mortality failure, or an end-of-life wear out failure, the analytical procedure is the same. First, the failure analysis lab must reproduce the problem. In some cases, the chip might begin working again after removal from the offending circuit board. The problem could have been a cold solder joint at the board, rather than a problem with the chip itself.

Hard failures, where the chip consistently faults on a particular signal, are relatively easy to detect and identify. Soft failures are more difficult. The chip may fail only at a particular frequency, or only at elevated temperatures. The chip might behave properly under test conditions, but one particular application might exercise a circuit block that just happens to contain vias with higher than normal resistance. The excess resistance could cause intermittent timing faults or excess heating in that particular area of the circuit. Any of these faults can be difficult to reproduce without a clear description of the problem.

Once the failure lab reproduces the problem, it tries to gather as much information by non-destructive means as possible. Sometimes the package itself may show signs of damage. A plastic part could be melted or scorched. A hermetic package could leak. Even if the failure seems obvious, the lab must clearly establish the cause and effect relationship. A part could have several different defects, only one of which caused the failure. Several defects might stem from the same underlying cause. Side-by-side comparison with known good chips can help establish which defects are actually relevant to the problem.

Tools like high-resolution x-ray and scanning acoustic microscopy can help analysts see inside the package. X-ray methods show the shape of wire bonds or solder bumps. For instance, a wire bond could break or pull loose because of uneven flow during molding. Solder bumps can deform.

Kessler explained that molding compound, die attach adhesive, and similar materials offer little resistance to x-ray penetration. Air gaps do not absorb x-rays at all. Such problems are easier to identify by scanning acoustic microscopy. Sound waves travel at different speeds in different kinds of materials. The frequencies typically used for acoustic microscopy travel well in ceramics and glasses, and moderately well in plastic materials. They do not propagate in air at all, clearly indicating delamination, voids, die cracks, and similar package defects. Acoustic microscopy can detect incomplete underfill and broken solder bonds.

The "echo time" it takes for an ultrasound pulse to return to the receiver depends on the material and the distance traveled. By examining a particular echo time, analysts can nondestructively "slice" through the package at a particular depth. Each source frequency propagates differently in the package materials. Combining different frequencies and echo times gives failure analysts a three-dimensional image of the interior of the package.

Once the lab has garnered as much information as possible by non-destructive means, it may remove all or part of the package. Removing the package reduces the pressure exerted by the molding compound. Intermittent failures due to popcorn cracking or thermal expansion mismatch may go away.

Back to top

Circuit navigation

Electrical test results refer to specific items on the circuit net list. CAD navigation software can help match the defective signal with a specific physical location. According to EGSoft's Scott Shen, it can also work in the opposite direction, correlating observed hot spots with particular signal lines. Like mask writing software, CAD navigation software must manipulate exponentially larger databases with each device generation. Only four points define a rectangle, but as many as twenty may be needed to define the serifs used in some optical proximity correction (OPC) schemes.

CAD navigation is more necessary for complex designs with five or more metal layers. Examining the transistor and lower metal layers in such designs can require careful cross-sectioning from the backside of the chip, an impossible task without navigation software. Because of the equipment cost these methods tend to be used by in-house failure analysis labs. Paquette said that independent labs tend to deal with less complex circuits and rely on less exotic analysis tools.

Navigation software helps, but still can't always give a precise indication of the failure site. A particular line might run the length of the chip. The failure lab will typically use several different sets of test vectors, combined with any special knowledge about the chip architecture or past failures, to try to locate the problem site as precisely as possible.

Once the package is open, Paquette said, a thin layer of liquid crystal applied to the chip surface can show hot spots. The liquid crystal transition temperature is low. Any circuit region with high resistance will heat up enough to cause a phase transition. Electron beam and mechanical probing can exercise suspect regions directly. Hot carrier injection, diode breakdown, and gate oxide leakage, among other defects, emit photons that light emission microscopy can detect.

Once the analyst has gleaned as much information as possible from these methods, cross-sectioning the part can further isolate the problem site or expose the source of the failure. Focused ion beam (FIB) milling is the tool of choice for this step. With FIB, the analyst can cut existing circuit lines and insert new lines and bond pads for mechanical or electron beam probing. If the problem appears to be due to a particular transistor or via chain, FIB facilitates probing of just those specific circuit elements. FIB in combination with CAD navigation software is essential for backside failure analysis of complex circuits.

FIB tools can examine suspected problem sites at electron beam resolution, including examination of contaminants to determine composition. If even this is not enough to identify the root cause of the problem, the transmission electron microscope (TEM) is the weapon of last resort. TEM requires extensive sample preparation, but can give atomic resolution images of interfaces and interfacial structures.

Such extreme resolution is becoming more necessary. For example, Bennett said, the 90-nm generation will require nitrided gate oxide only a few nanometers thick, with tolerances of only a few atomic layers. Both the fraction of nitrogen and its bonding state determine the properties of the dielectric. Some junctions require dopant concentrations above the solid solubility limit. Such junctions are metastable: dopants can actually precipitate out of solution under some conditions. Analysis of such atomic scale features would be impossible without atomic resolution imaging and surface analysis tools.

In the ideal world, failure analysis would be unnecessary. Defective parts would never reach customers. In the imperfect world of shrinking features and shorter market windows, it is an essential component of process improvement. Good failure analysis requires clear problem descriptions and close attention to cause and effect. It rewards the time and effort required by its ability to relate process defects to specific product failures.