title: Event Simulation and Reconstruction ...
An essential aspect of particle physics experiments is the accurate simulation of the physics taking place during the experiment. This simulation can be roughly broken into two areas. The first is generation of the event, performing simulation of the parton hard scatter as well as the production of accompanying prompt particles. The second is the simulation of how the particles resulting from the hard scatter propagate through the detector. The result of this simulation is a series of readings from different detector subsystems that can be collated and run through reconstruction algorithms to obtain a set of simulated events that can be compared with real data observed in the detector. Comparing simulation to observation is the key to drawing many conclusions about the fundamental physics occurring in the experiment. In particular, disagreement between the two could be an indication of new physics, but could also point out mis-modeling of the detector, the physics process, or simply bugs in any part of the software chain. Therefore, rigorous examination and crosschecks must be applied to instill confidence in the accuracy of the simulation.
Event generator software is used to simulate the initial hard-scatter of partons from the colliding protons. The software used by LHC experiments uses sophisticated Monte-Carlo (MC) techniques to both calculate cross-sections for SM and certain BSM processes, and to generate samples of events that can be compared with real data. The generation of events can be broken down into four distinct stages.
The strategy that CMS uses to reconstruct particles is referred to as "Particle Flow" (PF)[@CMS-PRF-14-001]. PF attempts to identify every particle in the event. This may seem like an obvious thing to do, but this approach requires a very highly granular detector to separate out nearby particles. High granularity becomes particularly important in the process of reconstructing jets from hadron decays. The calorimetry system can be used to measure the jet energy, but in combination with the tracker to measure the momentum of the charged components of the jet, the precision of measurement can be improved dramatically. The key, then, to PF reconstruction is to be able to link together hits in the different subdetectors to increase reconstruction effectiveness, while at the same time avoiding double counting. The output of the PF algorithm is a collection of reconstructed particles and their momentum. This information can then be fed into higher-level algorithms to reconstruct jets, including jet flavor, hadronically decaying taus, and MET.
Before linking can be done, the raw hit data from the individual subdetectors must be processed into higher level objects. For the tracker, the first step is to combine hits in individual pixels together into groups whose charges presumably originated from a single particle. This process is called clustering. The next step involves finding the sets of clusters across layers that that can be stringed together to form the track of a particle. The generic algorithm for this is Combinatorial Track Finding (CTF). The first step in CTF is finding pairs, triplets, or in the case of the Phase I upgrade, quadruplets, of clusters in the pixel detector that are consistent with a track originating from the luminous region of CMS. This gives an initial estimate of the particle's charge and momentum, along with associated uncertainties. A statistical model augmented with physical rules known as a Kalman Filter is employed to extrapolate this "seed" through the rest of the tracker, matching clusters as it goes. A helical best-fit is done on the final collection of hits. The quality of this fit, along with other quality metrics such as the number of layers with missing clusters, is used to decide if this track is to be kept for further processing. This procedure is performed in several iterations, starting with very stringent matching requirements to identify the "easy" to reconstruct particles (generally isolated, with high momentum). Hits from earlier iterations are removed so that subsequent iterations with looser matching requirements do not suffer from excessive combinatorics.
In the calorimeters, hits in individual ECAL crystals or HCAL towers are grouped to form Particle-Flow clusters. This clustering is performed by first finding local maxima in the energy deposition across the calorimeter. These maxima are called seeds. Cells adjacent to the seeds with energy above a certain threshold are then grouped with the seed. In a crowded event, it is possible for the clusters from different seeds to overlap. In this case, a specialized algorithm is employed to compute the sharing of energy between the seeds. An example event demonstrating this is shown in [@fig:cal_clusters].
The final step in the PF algorithm is to link together the building blocks from the four subdetectors: tracks from the tracker and muon system, and ECAL and HCAL clusters. Tracks from the tracker are matched with clusters by extrapolating the helical track to the depth in the calorimeter where the shower is expected to have started. If this position lies within a PF cluster, the track is linked with it. Bremsstrahlung photons from electrons are accounted for by looking for additional ECAL deposits that align with the tangent of the track at its intersection with each of the tracker layers. If such deposits exist, they are linked with the track also. Clusters in the ECAL and HCAL, as well as the PS and ECAL, can be linked if the ECAL(PS) cluster lies within the boundaries of an HCAL(ECAL) cluster in the $\eta-\phi$ plane. Finally, if a combined fit between a tracker track and a muon system track results in a sufficiently small $\chi^2$, the two tracks are linked together.
Like the tracking algorithm described above, the reconstruction of electrons begins with the creation of electron seeds. The construction of electron seeds consists of linking together tracker seeds with ECAL Superclusters (SC). These tracker seeds originate through the same method as the seeds used in PF tracking. They are collections of clusters across layers in the pixel detector that are consistent with a particle originating in the luminous region. A supercluster is a set of nearby ECAL crystals whose energy distribution passes certain criteria to allow them to be grouped together. The procedure for matching tracker seeds with SCs is as follows:
If a tracker seed has enough matching hits it is paired with the SC, and they are passed together for full electron tracking. This involves the use of a modified Kalman Filter known as a Gaussian Sum Filter to propagate the initial track through the rest of the tracker. These "GSF electrons" continue on to several more filters and corrections before they are used in physics analyses.
Electron seeds created through this procedure are referred to as "ECAL-driven". There are also "tracker-driven" electron seeds which are not matched with ECAL SCs, and proceed through the GSF tracking algorithm based on tracker information alone. After the track has been reconstructed, it may then be matched with an SC. The tracker-driven seeds help to compensate for inefficiencies coming from the ECAL-driven matching procedure. In the end, the ECAL-driven and tracker-driven electron collections are merged to produce a single collection of electron candidates.
This general approach was developed during early CMS operation, but the installment of the Phase I pixel detector motivated a re-write of the underlying algorithm to handle the additional layers. The new implementation was designed to be more flexible, allowing for a variable requirement on the number of matched hits, a significant change from the previous algorithm which would only match two hits. Work was done to optimize this new hit-matching algorithm for the new pixel detector. In particular, the $\delta \phi$ and $\delta R/z$ windows were tuned to maximize efficiency while also keeping the number of spurious electron seeds at a minimum.
The size of these matching windows is defined parametrically in terms of the transverse energy of the SC. The function for determining the cut is
$$ \delta(E_T)= \begin{cases}
E_T^{\mathrm{high}} ,& E_T > E_T^{\mathrm{thresh}} \\
E_T^{\mathrm{high} } + s(E_T - E_T^{\mathrm{thresh}}) ,& \mathrm{otherwise}
\end{cases} $$
Normally, $s$ is negative which means that the cut becomes tighter with higher $E_T$ until $E_T^{\mathrm{thresh}}$ where it becomes a constant. So the optimization was to $E_T^{\mathrm{high}}$, $E_T^{\mathrm{thresh}}$, and $s$. The challenge to performing a rigorous optimization of these parameters is that they are individually defined for the first, second, and third+ matched hits and also separately for $\phi$ and in $R/z$, meaning that the optimization would have to be performed in many dimensions. In addition, the figure of merit for the optimization is difficult to clearly define since a balance must be struck between efficiency and purity. Given these challenges, a more ad-hoc approach was developed. This consisted of first acquiring two samples of simulated data: A $t\bar{t}$ sample with many jets containing photons and charged hadrons capable of faking electrons, as well as genuine electrons from W boson decays, and a $Z\rightarrow e^+e^-$ sample with many genuine electrons and relatively few electron faking objects. Next, define efficiency and purity metrics. Because this is simulated data, it is possible to perform truth-matching to determine if a reconstructed track originated from an electron or not. Using this, efficiency is then defined as the proportion of simulated electrons that get successfully reconstructed as ECAL-driven GSF tracks, and purity is the proportion of ECAL-driven GSF tracks that resulted from a simulated electron.
However, before optimizing cuts or calculating efficiency and purity, it is necessary to understand what the hit residuals look like. Hit residuals are the $\delta \phi$ and $\delta R/z$ between hits and the projected track in the matching procedure. The simplest way to study the residuals is to run the hit-matching algorithm with very wide windows to avoid imposing an artificial cutoff on the distributions of the residuals. [@Fig:combined_residuals] shows the distributions of residuals of hits in the innermost barrel layer for truth-matched and non-truth-matched seeds. Also plotted are the contours where 90% or 99.5% of hits are to the left. Note the distinctly different distributions based on whether the seed was truth-matched or not. Recall that for the first hit, the luminous region is used for one end of the projected track, and since the luminous region is significantly elongated in z, the residuals can be quite large. However, the $\delta \phi$ residuals for truth-matched tracks tend to be tiny (<.5°), while non-truth-matched residuals tend to be larger, especially at low $E_T$.
[@Fig:combined_residuals_L2] shows the residuals for the second matched hit for hits in the second barrel layer. After matching the first pixel hit, the projection of the track becomes much more precise, roughly a factor of 20 times smaller for $\delta \phi$ and over 200 times smaller for $\delta z$.
Guided by these distributions, four sets of windows were designed with varying degrees of restriction. The parameters for these windows are listed in [@tbl:windows].
extra-narrow |
narrow (default) |
wide |
extra-wide |
||
---|---|---|---|---|---|
Hit 1 | $\delta \phi$ : $E_T^\mathrm{high}$ | 0.025 | 0.05 | 0.1 | 0.15 |
$\delta \phi$ : $E_T^\mathrm{thresh}$ | 20.0 | 20.0 | 20.0 | 20.0 | |
$\delta \phi$ : $s$ | -0.002 | -0.002 | -0.002 | -0.002 | |
$\delta R/z$ : $E_T^\mathrm{high}$ | 9999.0 | 9999.0 | 9999.0 | 9999.0 | |
$\delta R/z$ : $E_T^\mathrm{thresh}$ | 0.0 | 0.0 | 0.0 | 0.0 | |
$\delta R/z$ : $s$ | 0.0 | 0.0 | 0.0 | 0.0 | |
Hit 2 | $\delta \phi$ : $E_T^\mathrm{high}$ | 0.0015 | 0.003 | 0.006 | 0.009 |
$\delta \phi$ : $E_T^\mathrm{thresh}$ | 0.0 | 0.0 | 0.0 | 0.0 | |
$\delta \phi$ : $s$ | 0.0 | 0.0 | 0.0 | 0.0 | |
$\delta R/z$ : $E_T^\mathrm{high}$ | 0.025 | 0.05 | 0.1 | 0.15 | |
$\delta R/z$ : $E_T^\mathrm{thresh}$ | 30.0 | 30.0 | 30.0 | 30.0 | |
$\delta R/z$ : $s$ | -0.002 | -0.002 | -0.002 | -0.002 | |
Hit 3+ | $\delta \phi$ : $E_T^\mathrm{high}$ | 0.0015 | 0.003 | 0.006 | 0.009 |
$\delta \phi$ : $E_T^\mathrm{thresh}$ | 0.0 | 0.0 | 0.0 | 0.0 | |
$\delta \phi$ : $s$ | 0.0 | 0.0 | 0.0 | 0.0 | |
$\delta R/z$ : $E_T^\mathrm{high}$ | 0.025 | 0.05 | 0.1 | 0.15 | |
$\delta R/z$ : $E_T^\mathrm{thresh}$ | 30.0 | 30.0 | 30.0 | 30.0 | |
$\delta R/z$ : $s$ | -0.002 | -0.002 | -0.002 | -0.002 |
: Parameters for pixel-matching windows. The narrow
window is what is used in the HLT, and will also be referred to as default
, while the extra-narrow
, wide
, and extra-wide
windows are the HLT settings scaled by 0.5, 2, and 3, respectively. Bold numbers indicate parameters that are modified across window settings. {#tbl:windows}
The pixel-matching algorithm was run on simulated data and efficiency and purity were measured, as is plotted in [@fig:gsf_roc]. Of the four working points, extra-narrow
was discarded for its distinct drop in efficiency and the extra-wide
working point was dropped for being negligibly different from the wide
working point. The performance of the remaining two working points in both $Z\rightarrow e^+e^-$ and $t\bar{t}$ are shown in [@fig:gsf_combined]. This figures shows efficiency and purity differentially in $p_T$, $\eta$, and $\phi$ of the GSF tracks that resulted from the pixel-matched seeds. The performance of the algorithm is demonstrated to be as good or better than the original implementation.
Another feature of the new algorithm is that it tends to produce overall fewer seeds. In the old algorithm, only pairs of hits are matched. However, the additional layers of the Phase I pixel tracker created tracker seeds with up to four hits making matching only two hits easier and resulting in more electron seeds. The additional seeds created computational issues for HLT, which motivated the rewrite. [@Fig:n_seeds] shows how the number of seeds is reduced by the new algorithm.
An important feature of the initial implementation was that in the process of matching two hits in the tracker seed, it could skip hits. Hit skipping refers to a feature of the matching algorithm such that if a hit fails to satisfy its prescribed matching criteria, it is skipped and that criteria is instead applied to the next hit in the tracker seed. As long as a sufficient number of hits match, the tracker seed is matched with the SC. The rewrite did not include this feature, and lack of hit-skipping can lead to a loss of efficiency because it is more selective about what seeds get matched with each SC. To avoid this inefficiency, a "hack" was introduced.
To understand the hack, one first has to understand how these tracker seeds are constructed. As mentioned previously, these seeds are normally created through several iterations, starting with quite stringent requirements. Hits that make it into these initial seeds are removed from consideration in subsequent iterations. By removing hits in each iteration, later iterations can have much less stringent requirements on matching hits across layers without creating an unnecessarily large number of tracker seeds. How is this related to hit-skipping? Suppose that seeds are generated in three iterations: First, look for quadruplets across four BPIX layers. Second, look for triplets across any combinations of three of the four layers. Finally, look for pairs across combinations of two of the four layers. Let's consider the case where the detector recorded one hit in each BPIX layer and all four hits can match with each other for the purposes of making tracker seeds. The top row of [@fig:hit_skipping] demonstrates how the seeding procedure would normally work. There would be a single quadruplet seed, and no pairs or triplets since all four hits have been removed from consideration in subsequent iterations. This quadruplet is then compared with some SC to make an electron seed. We require that there be three matched hits for triplets and quadruplets, and just two for pairs. During the matching procedure, the hit in BPIX layer 3 (BPIX3
) fails to match, however it can be skipped and hit 4 matches. Therefore, the tracker seed matches with the SC and proceeds through the rest of the reconstruction chain. However, if hit-skipping is disabled one gets the situation in the middle row of [@fig:hit_skipping] where the unmatched hit in layer 3 cannot be skipped and as a result only two hits are matched and no electron seed gets produced. The "hack" to recover this seed is to disable the removal of hits between tracker seed construction iterations. As a result, many more seeds are created, as shown in the bottom row of [@fig:hit_skipping]. However, the seed (now a triplet) of hits in the first, second, and fourth layers now matches and the electron seed is recovered.
From a computational standpoint, the hacked situation is not ideal. Therefore, work was done to add the hit-skipping feature to the new implementation. Adding hit skipping to the new implementation and removing the hack reduced the number of seeds in $t\bar{t}$ by 41% with the narrow
matching window sizes and 36% for the wide
windows. Compared to the original pixel-matching implementation, there were also far fewer seeds, dropping from 12.6 on average to 2.5 for the narrow
windows and 4.6 for the wide
. This is all while maintaining comparable efficiency and purity performance.