I'm pretty sure there was an older one of these but I couldn't find one apart from one back in 2013. So, I'm going to sticky this, and if anyone want to talk statistical analysis etc. Here you go! To kick things off, just saw an article from January stating that NFL's NextGen Stats have a new QB passing metric - Passer Rating: https://www.aboutamazon.com/news/aws/nfl-unveils-new-stat-ranking-qb-passing-performance https://www.nfl.com/news/next-gen-stats-intro-to-passing-score-metric "We teamed up with the AWS Proserve data science group to develop a more comprehensive metric for evaluating passing performance. Built off of seven different AWS-powered machine-learning models, the NGS Passing Score seeks to assess a quarterback's execution on every pass attempt and transform that evaluation into a digestible score with a range between 50 and 99. The score can be aggregated on any sample of pass attempts while still maintaining validity in rank order (more on this later). Before we dive into the passing score formula, it is important to remember the why." Also, a few older articles considering the challenge of evaluating QB performance, including one with a little input from Griese: https://www.businessinsider.com/thinking-quarterback-bob-griese-2011-8 https://www.espn.com/espn/print?id=6299428&type=story https://www.theringer.com/2020/4/17/21224389/nfl-draft-quantifying-quarterback-evaluation

It's pretty technical. Some of this goes over my head. People who are interested should read through it. But at the end it says this: "More scores to come This is just the beginning. We will continue to iterate and apply our scoring methodology to other advanced stats and position groups. The NGS Passing Score is a smaller component of an even bigger score -- the Quarterback Score. But that requires components representing rushing performance, the ability to avoid pressure and sacks, and even the elusive analysis of determining the optimal target at every time stamp." For this 'Passing Score', some details: "How the NGS Passing Score works Instead of simply awarding all passing yards, touchdowns and interceptions to the quarterback, the NGS Passing Score equation leverages the outputs of our models to form the components that best ... Evaluate passing performance relative to a league-average expectation. Isolate the factors that the quarterback can control. Represent the most indicative features of winning football games. Encompass passing performance in a single composite score (ranging from 50 to 99). Generate valid scores at any sample size of pass attempts. Armed with a collection of powerful AI-driven tools, it is only fitting we put the pieces together to form the seven measurable components that make up our new passing score (listed in order of weight in the formula): (I) Expected Points Added Over Expected (EPAOE) accounts for 46 percent of the passing score. EPAOE measures production relative to an expected value (using our new expected yards model) and is calculated as the difference between the actual value of a pass and the predicted value of the pass before the ball is thrown, when accounting for the probability of each pass outcome (e.g., completion, incompletion or interception). (II) Expected Points Added (EPA) accounts for 18 percent of the passing score. Instead of quantifying the success of a play in terms of yards gained, EPA represents success in terms of points added relative to the current play. (III) Completion Percentage Over Expected (CPOE) accounts for 11 percent of the passing score. CPOE is a derivative of completion probability, which measures the success of a pass relative to the difficulty of the throw. The CPOE feature used in the score does adjust for dropped passes. (IV) Interception Probability (INT Probability) accounts for 11 percent of the passing score. INT Probability measures the likelihood that a pass will be intercepted if thrown. (V) Air Expected Points Added (Air EPA) accounts for 7 percent of the passing score. Air EPA is equal to the value of a completion plus the yards a receiver would be expected to gain after the catch. Air EPA is a proxy for the optimal reward of a pass within the control of the quarterback. (VI) Expected Air EPA (xAir EPA) accounts for 7 percent of the passing score. xAir EPA is equal to the value of a completion (plus expected YAC), relative to the likelihood of a completion (e.g., completion probability). (VII) Win Probability (WP) is not a feature in the model, but it is used as an aggregation play-weight. On any given play, the offense's pre-snap win probability for is used as a weight in the passing score formula, where closer to 50 percent win probability equals one and closer to 10 percent or 90 percent equals 0.6. Each component is converted to a standardized z-score based on the population of all pass attempts from the 2018 through 2021 seasons (n = 70,439). To reduce the impact any one component has on dominating the score, each individual z-score was selectively clipped at 3 or 4 standard deviations below and above the mean. The linear combination of components make up an individual play score that ranges from 50 to 100. Now onto the aggregation step: Take the average of individual play scores weighted by the offenses' pre-play win probability using a parabolic shape centered at 0.5. In other words, plays in close games have a greater weight in the formula than plays closer to the extremes. Plays with less than 10 percent win probability or greater than 90 percent win probability are worth roughly 60 percent less in the aggregation formula than a play in a game at 50-50 odds. But that's not all. To account for small-sample bias, the passing score aggregation formula uses the James-Stein estimator to "shrink" predictions closer to the population average. This bayesian approach became a popular technique in recent years to solve for small-sample issues when predicting batting average in baseball. Because the NGS Passing Score leverages this solution, we will have the ability to evaluate quarterback play at the season, game and situational level, while still maintaining a consistent distribution shape of scores. "

Note they don't tell you what the correlation to win% is. That's the first thing to ask because if that's not much higher than with EPA what's the point? Of course, if it increases correlation to win% by a huge amount over EPA this would be big news, so I doubt they succeeded at that: it would be the first thing they'd tell you. Anyway, most of that metric is EPA, or some derivative of it, so it doesn't look like it's adding much. At least you can see what they're doing though, which is unlike ESPN or DVOA where they keep it black box because they don't want to tell you all their subjective assumptions. Here they're just trying to find optimal weights on known components. Anyway, NGS has introduced some very useful stats like CPOE, so overall I think they do a good job, but this doesn't seem to be adding much.

How would you go about determining whether, if a measure like this is less strongly correlated with winning, it's nonetheless more valid than other measures that are more strongly correlated with winning in measuring QB play independent of other areas of the game?

You'd have to categorize stats based on how much of a team stat or QB stat they are and then compare correlation to win% within each category. For example, I'd categorize stats that include reference to scoring separately from those that don't. EPA literally is expected points scored, and passer rating includes TD% so those stats are in some way "cheating" by looking at the final outcome. Y/A and sack% do not, so how much you can maximize correlation to win% using stats within each category would be interesting. EPA and passer rating can also be put in different categories because one includes the running game while the other doesn't. Otherwise, within each category you can compare models using methods for model selection, like the Akaike Information Criterion, which balances "complexity" vs. "predictive power". The more complex the model is the better its predictive power, but at some point the increase in complexity outweighs the improvement in predictive power, and the AIC is one of several approaches that quantifies this balance, i.e., you want the simplest model with greatest predictive power: https://en.wikipedia.org/wiki/Akaike_information_criterion

In McDaniel's interview with Travis today, I thought it was smart to ask about statistical analysis and how that plays into decision making. McDaniel's answer was that it's a tool to use, one of many tools, and it can't dictate what you do in certain situations. For instance, you'd never do something 100% of the time because stats tell you to...you'd have to look at the opponent and their tendencies, the game time situation, what you've learned from experience, the talent you have available, etc. I thought it was a very smart answer.

It does, I believe. I didn't paste the whole article. Here's a piece from the relevant section, I think: "A passing stat that correlates with wins So how well does the NGS Passing Score correlate with winning football games? We took a look at 202 individual seasons from 88 different quarterbacks over the last four years, grouped each season into buckets of five (95-plus, 90 to 95, 85 to 90, etc.), and compared the win-loss record and percentage of playoff berths across each bucket. The relationship between a player's single-season NGS Passing Score and winning percentage is quite strong. A score around 85 serves as an indicator of a winning percentage near the .500 mark. A score above 85, and your team is more than likely winning with, rather than in spite of, their quarterback. A score of 90-plus -- those are the players you win because of. We group single-season scores into five-point buckets at the single-season level, with clear thresholds for quality of play. The distribution of passing score points to 80 as a rough Mendoza Line for starting-level QB performance. Quarterbacks falling below that line are often young players acclimating to the league, or replacement-level talent that teams will look to upgrade from in the following season. Quarterbacks falling between a score of 80 and 90 are a mix of guys you can win in spite of and win with. Finally, passing scores above 90 can be roughly considered elite, the players you win because of. Over the last four seasons, there have been 10 quarterbacks to finish the season with a score of 95-plus. Only two (Deshaun Watson in 2020 and Derek Carr in 2019) played for teams that failed to make the playoffs during the season in which they reached that mark. The tradeoff between correlation and stability..." Anyone interested in this should read the whole thing.

What you posted wouldn't answer the question. Grouping into buckets removes important information for assessing the relationship. However, you're right that they post this later: Not sure exactly who they're including or not, but correlation to win% for passer rating is 0.67. That number goes down if you selectively include only certain QBs which NGS might be doing, but on the face of it this isn't an improvement over even simple passer rating. Also, note they're not comparing to EPA, which is what we really want to know. Comparing to PFF which is subjective and not a stat, and to QBR which is widely panned because it includes unknown subjective assumptions, is selectively comparing to precisely those things we don't need to compare the metric to. Not very forthcoming.

The whole thing is a mess when the year-to-year correlation of the best statistic we know of in purportedly measuring QB play independent of other areas of the game (the PFF grade) is a mere 0.52. Anything purporting to measure QB ability as a trait should do far better than that in my opinion, assuming we're talking about seasons of QBs' careers after which their performance has leveled off.

Actually that's a good point because the year-to-year correlation for a bunch of passing stats is just at around that 0.5 level. That's true for EPA per pass play and also for passer rating. They're saying the year-to-year correlation for NGS is lower at 0.42. And I agree that's really low. A 0.5 correlation translates to 25% of the variance in QB play measured by that stat in any given year explained by that stat in the previous year. The environment changes sure, but you'd think a "true" measure of QB ability would at least get to explaining 50% of variation in play the next year, i.e., about 0.7 year-to-year correlation.

That's a more interesting question than some might realize. In principle you always go for prediction. That's how theories are tested in science: predict something and see if your model worked or not. The problem is overfitting. Take a very simple case where the physics or the biology suggest modeling something as an exponential decay process. However, you don't know the parameter values (e.g., you don't know the rate of decay). If you estimate the parameters of this model based on a smaller sample of data, what you'll likely end up with is a model that fits well to the data you have, but predicts poorly because it overfits the data. In some cases using nothing more than the mean of data points you have gives you better prediction than using a more biologically plausible model where you estimated parameters from your sample. So in that case correlation would be superior. The question is therefore somewhat tricky: have to balance model complexity with both predictive power AND overfitting if you're going to apply it to smaller sample sizes. As sample size increases the difference between using correlation and prediction goes away.

Another problem with over-fitting data is when it is combined with culling data. A researcher may exclude certain data because of reasons such as: being an outlier; old data; suspect methodology; being non representative of the target demographic; differences in measurement standards and so on. Researchers almost never exclude data that fits with their hypothesis, so even if they are applying their standards in good faith they can end up creating a data set that is unrepresentative of reality. In other cases they are deliberately putting their thumb on the scale to get the result they want. The other problem which I have run into with overfitting is that models work best with an optimum number of variables, and generally speaking the most robust predictive models have relatively few moving parts. Eg the NFL passer rating uses yards/attempt, completion% , TD% and INT% and has remarkably strong predictive power, although it needs to be either standardized to a common base or use z-scores to compare different years, and is a stronger predictor than some of the black box methods such as DVOA which use thousands of lines of code. When you are dealing with a small set of data you can overfit for what is true for that particular sample, but isn’t necessarily true for the larger group. The other issue is that some variables are dependent variables and others are independent, and if you don’t identify which variables are dependent and how they interact with other variables then you can have problems with overfitting.

As a Quality professional, the best thing I was ever taught "Statistics don't lie, but statisticians do". Its just SPC, Statistical Process Control. My thing with SPC, it's meant to be applied to a process, machines, for fine tuning. NOT a person, who has full control over functioning.... I like McDaniels approach to the reinventing of the wheel... call its Analytics or SPC, it's the same thing. Maybe it has its place in football...but any self respecting Head Coach might use it as a tool, but basically, pays little attention to it.