title: Event Simulation and Reconstruction ...
An essential aspect of particle physics experiments is the accurate simulation of the physics taking place during the experiment. This simulation can be roughly broken into two areas. The first is the simulation of the "hard scatter", or the energetic parton-parton interaction that seeds an event. The second is the simulation of how the particles resulting from the hard scatter propagate through the detector. The result of this simulation is a series of readings from different detector subsystems that can be collated and run through reconstruction algorithms to obtain a set of simulated events that can be compared with real data observed in the detector. Comparing simulation to observation is the key to drawing conclusions about the fundamental physics occurring in the experiment. In particular, disagreement between the two could be an indication of new physics, but could also point out mis-modeling of the detector or simply bugs in any part of this software chain. Therefore, rigorous examination and crosschecks must be applied to instill confidence in the accuracy of the simulation.
Event generator software is used to simulate the initial hard-scatter of partons from the colliding protons. The software used by LHC experiments uses sophisticated Monte-Carlo (MC) techniques to both calculate cross-sections for SM and certain BSM processes, and to generate samples of events that can be compared with real data.
The strategy that CMS uses to reconstruct particles is referred to as "Particle Flow" (PF). PF attempts to identify every particle in the event. This may seem like an obvious thing to do, but this approach requires a very highly granular detector to separate out nearby particles. High granularity becomes particularly important in the process of reconstructing jets from hadron decays. The calorimetry system can be used to measure the jet energy, but in combination with the tracker to measure the momentum of the charged components of the jet, the precision of measurement can be improved dramatically. The key, then, to PF reconstruction is to be able to link together hits in the different subdetectors to increase reconstruction effectiveness, while at the same time avoiding double counting. The output of the PF algorithm is a collection of reconstructed particles and their momentum. This information can then be fed into higher-level algorithms to reconstruct jets, including jet flavor, hadronically decaying taus, and MET.
Before linking can be done, the raw hit data from the individual subdetectors must be processed into higher level objects. For the tracker, this involves finding the sets of hits across layers that originated from the same particle. The generic algorithm for this is Combinatorial Track Finding (CTF)[]. The first step in CTF is finding pairs or triplets of hits in the pixel detector that are consistent with a track originating from the luminous region of CMS. This gives an initial estimate of the particles charge, and momentum, along with associated uncertainties. A statistical model augmented with physical rules known as a Kalman Filter [] is employed to extrapolate the "seed" through the rest of the tracker and matching hits as it goes. A helical best-fit is done on the collection of hits. The quality of this fit, along with other quality metrics such as the number of missing hits, is used to decide if this track is authentic or spurious. This procedure is performed in several iterations, starting with very stringent matching requirements to identify the "easy" to reconstruct particles (generally isolated, with high momentum). Hits from earlier iterations are removed so that subsequent iterations with looser matching requirements.
TODO: Talk more about particle-flow
Like the tracking algorithm described above, the reconstruction of electrons begins with the creation of seeds. These seeds consist of groups of hits in the pixel detector that are consistent with a track originating in the luminous region. Each seed is compared with energy deposits in the ECAL to find consistent matches. The ECAL object that is actually used is the Supercluster (SC), a set of nearby ECAL crystals whose energy distribution passes certain criteria to allow them to be grouped together. The procedure for matching seeds with SCs is as follows:
If a seed has enough matching hits it is paired with the SC, and they are passed together for full electron tracking. This involves the use of a modified Kalman Filter known as a Gaussian Sum Filter to propagate the initial track through the rest of the tracker. These "GSF electrons" continue on to several more filters and corrections before they are used in physics analyses.
Electron seeds created through this procedure are referred to as "ECAL-driven". There are also "tracker-driven" electron seeds which are not matched with ECAL SCs, and proceed through the GSF tracking algorithm based on tracker information alone. After the track has been reconstructed, it may then be matched with an ECAL SC. The tracker-driven seeds help to compensate for inefficiencies coming from the ECAL-driven matching procedure. In the end, the ECAL-driven and tracker-driven electron collections are merged to produce a single collection of electron candidates.
This general approach was developed during early CMS operation, but the installment of the Phase I pixel detector motivated a re-write of the underlying algorithm to handle the additional layers. The new implementation was designed to be more flexible, allowing for a variable requirement on the number of matched hits, a significant change from the previous algorithm which would only match two hits. Work was done to optimize this new hit-matching algorithm for the new pixel detector. In particular, the $\delta \phi$ and $\delta R/z$ windows were tuned to maximize efficiency while also keeping the number of spurious electron seeds at a minimum.
The size of these matching windows is defined parametrically in terms of the transverse energy of the SC. The function for determining the cut is
$$ \delta(E_T)= \begin{cases}
E_T^{\mathrm{high}} ,& E_T > E_T^{\mathrm{thresh}} \\
E_T^{\mathrm{high}} + s(E_T - E_T^{\mathrm{thresh}}) ,& \mathrm{otherwise}
\end{cases} $$
Normally, $s$ is negative which means that the cut becomes tighter with higher $E_T$ until $E_T^{\mathrm{thresh}}$ where it becomes a constant. So the optimization was to $E_T^{\mathrm{high}}$, $E_T^{\mathrm{thresh}}$, and $s$. The challenge to performing a rigorous optimization of these parameters is that they are individually defined for the first, second, and third+ matched hits and also separately for $\phi$ and in $R/z$, meaning that the optimization would have to be performed in many dimensions. In addition, the figure of merit for the optimization is difficult to clearly define since a balance must be struck between efficiency and purity. Given these challenges, a more ad-hoc approach was developed. This consisted of first acquiring two samples of simulated data: A $t\bar{t}$ sample with many jets containing photons and charged hadrons capable of faking electrons, as well as genuine electrons from W boson decays, and a $Z\rightarrow e^+e^-$ sample with many genuine electrons and relatively few electron faking objects. Next, define efficiency and purity metrics. Because this is simulated data, it is possible to perform truth-matching to determine if a reconstructed track originated from an electron or not. Using this, efficiency is then defined as the proportion of simulated electrons that get successfully reconstructed as tracks, and purity is the proportion of tracks that resulted from a simulated electron.
However, before optimizing cuts or calculating efficiency and purity, it is necessary to understand what the hit residuals look like. Hit residuals are the $\delta \phi$ and $\delta R/z$ between hits and the projected track in the matching procedure. The simplest way to study the residuals is to run the hit-matching algorithm with very wide windows to avoid imposing an artificial cutoff on the distributions of the residuals. [@Fig:combined_residuals] shows the distributions of residuals of hits in the innermost barrel layer for truth-matched and non-truth-matched seeds. Also plotted are the contours where 90% or 99.5% of hits are to the left. Note the distinctly different distributions based on whether the seed was truth-matched or not. Recall that for the first hit, the luminous region is used for one end of the projected track, and since the luminous region is significantly elongated in z, the residuals can be quite large. However, the $\delta \phi$ residuals for truth-matched tracks tend to be tiny (<.5°), while non-truth-matched residuals tend to be larger, especially at low $E_T$.
[@Fig:combined_residuals_L2] shows the residuals for the second matched hit for hits in the second barrel layer. After matching the first pixel hit, the projection of the track becomes much more precise, roughly a factor of 20 times smaller for $\delta \phi$ and over 200 times smaller for $\delta z$.
Guided by these distributions, four sets of windows were designed with varying degrees of restriction. The parameters for these windows are listed in [@tbl:windows].
extra-narrow |
narrow (default) |
wide |
extra-wide |
||
---|---|---|---|---|---|
Hit 1 | $\delta \phi$ : $E_T^\mathrm{high}$ | 0.025 | 0.05 | 0.1 | 0.15 |
$\delta \phi$ : $E_T^\mathrm{thresh}$ | 20.0 | 20.0 | 20.0 | 20.0 | |
$\delta \phi$ : $s$ | -0.002 | -0.002 | -0.002 | -0.002 | |
$\delta R/z$ : $E_T^\mathrm{high}$ | 9999.0 | 9999.0 | 9999.0 | 9999.0 | |
$\delta R/z$ : $E_T^\mathrm{thresh}$ | 0.0 | 0.0 | 0.0 | 0.0 | |
$\delta R/z$ : $s$ | 0.0 | 0.0 | 0.0 | 0.0 | |
Hit 2 | $\delta \phi$ : $E_T^\mathrm{high}$ | 0.0015 | 0.003 | 0.006 | 0.009 |
$\delta \phi$ : $E_T^\mathrm{thresh}$ | 0.0 | 0.0 | 0.0 | 0.0 | |
$\delta \phi$ : $s$ | 0.0 | 0.0 | 0.0 | 0.0 | |
$\delta R/z$ : $E_T^\mathrm{high}$ | 0.025 | 0.05 | 0.1 | 0.15 | |
$\delta R/z$ : $E_T^\mathrm{thresh}$ | 30.0 | 30.0 | 30.0 | 30.0 | |
$\delta R/z$ : $s$ | -0.002 | -0.002 | -0.002 | -0.002 | |
Hit 3+ | $\delta \phi$ : $E_T^\mathrm{high}$ | 0.0015 | 0.003 | 0.006 | 0.009 |
$\delta \phi$ : $E_T^\mathrm{thresh}$ | 0.0 | 0.0 | 0.0 | 0.0 | |
$\delta \phi$ : $s$ | 0.0 | 0.0 | 0.0 | 0.0 | |
$\delta R/z$ : $E_T^\mathrm{high}$ | 0.025 | 0.05 | 0.1 | 0.15 | |
$\delta R/z$ : $E_T^\mathrm{thresh}$ | 30.0 | 30.0 | 30.0 | 30.0 | |
$\delta R/z$ : $s$ | -0.002 | -0.002 | -0.002 | -0.002 |
: Parameters for pixel-matching windows. The narrow
window is what is used in the HLT, and will also be referred to as default
, while the extra-narrow
, wide
, and extra-wide
windows are the HLT settings scaled by 0.5, 2, and 3, respectively. Bold numbers indicate parameters that are modified across window settings. {#tbl:windows}
The pixel-matching algorithm was run on simulated data and efficiency and purity were measured, as is plotted in [@fig:gsf_roc]. Of the four working points, extra-narrow
was discarded for its drop in efficiency and the extra-wide
working point was dropped for being negligibly different from the wide
working point. The performance of the remaining two working points in both $Z\rightarrow e^+e^-$ and $t\bar{t}$ are shown in [@fig:gsf_combined]. This figures shows efficiency and purity differentially in $p_T$, $\eta$, and $\phi$ of the GSF tracks that resulted from the pixel-matched seeds. The performance of the algorithm is demonstrated to be as good or better than the original implementation.
Another feature of the new algorithm is that it tends to produce overall fewer seeds.
However, it was initially implemented without the ability to skip hits in the matching procedure. Hit skipping refers to a feature of the matching algorithm such that if a hit fails to satisfy its prescribed matching criteria, it is skipped and that criteria is instead applied to the next hit in the seed. As long as a sufficient number of hits match, the seed is accepted. The lack of hit-skipping can lead to a loss of efficiency because it is more selective about what seeds get matched with each SC. To avoid this inefficiency, a "hack" was introduced.
To understand the hack, one first has to understand how these tracker seeds are constructed. As mentioned previously, these seeds are normally created through several iterations, starting with quite stringent requirements. Hits that make it into these initial seeds are removed from consideration in subsequent iterations. By removing hits in each iteration, later iterations can have much less stringent requirements on matching hits across layers without creating an unnecessarily large number of seeds. How is this related to hit-skipping? Suppose that seeds are generated in three iterations: First, look for quadruplets across four BPIX layers. Second, look for triplets across any combinations of three of the four layers. Finally, look for pairs across combinations of two of the four layers. Let's consider the case where the detector recorded one hit in each BPIX layer and all four hits can match with each other for the purposes of making tracker seeds. The top row of [@fig:hit_skipping] demonstrates how the seeding procedure would normally work. There would be a single quadruplet seed, and no pairs or triplets since all four hits have been removed from further consideration. This quadruplet is then compared with some SC to make an electron seed. We require that there be three matched hits for triplets and quadruplets, and just two for pairs. During the matching procedure, the hit in BPIX layer 3 (BPIX3
) fails to match, however it can be skipped and hit 4 matches. Therefore, the tracker seed matches with the SC and proceeds through the rest of the reconstruction chain. However, if hit-skipping is disabled one gets the situation in the middle row of [@fig:hit_skipping] where the unmatched hit in layer 3 cannot be skipped and as a result only two hits are matched and no electron seed gets produced. The "hack" to recover this seed is to disable the removal of hits between seed construction iterations. As a result, many more seeds are created, as shown in the bottom row of [@fig:hit_skipping]. However, the seed (now a triplet) of hits in the first, second, and fourth layers now matches and the electron seed is recovered.
From a computational standpoint, the hacked situation is not ideal. Therefore, work was done to add the hit-skipping feature to the new implementation. Adding hit skipping to the new implementation and removing the hack reduced the number of seeds in $t\bar{t}$ by 41% with the narrow
matching window sizes and 36% for the wide
windows. Compared to the original pixel-matching implementation, there were also far fewer seeds, dropping from 12.6 on average to 2.5 for the narrow
windows and 4.6 for the wide
. This is all while maintaining comparable efficiency and purity performance.