Statistical methods for football

cbrad · Feb 4, 2020

A good suggestion was made by Irishman to create a dedicated thread for the discussion of statistical methodology in football. The idea is to transfer as much of the discussion about statistical methodology to one thread so that it doesn’t clutter up other threads. Results of statistical analysis can still be presented in other threads, but discussion about statistical methodology is probably better off in its own thread.

I’ll start off this thread by giving a simple intro into 4 commonly talked about methods:
1) z-scores
2) correlations
3) confidence intervals
4) hypothesis tests

z-scores

The purpose of z-scores is to take different sets of measurements and put them on the same scale. A common application of z-scores in football is to compare stats across eras.

I’m not sure how best to explain z-scores (for me it would be best to just show the math.. one line and we’re done lol), but one way to think of z-scores is to ask how you would convert between two different scales like Celsius and Fahrenheit. First you need to know what the origin of the scale is (where zero lies): zero degrees Celsius corresponds to 32 degrees Fahrenheit. And then you need to know how the units of measurements compare: every 5 degrees increase in Celsius corresponds to a 9 degree increase in Fahrenheit.

Same thing with z-scores. First you need to know what the origin corresponds to: a z-score of zero always corresponds to league average. Then you need to know the “unit of measurement”, which in this case is the standard deviation (a measure of how far the values go from league average). For those interested in how standard deviation is calculated:
https://en.wikipedia.org/wiki/Standard_deviation

An example of how to convert a stat like passer rating to z-scores. In 1970, the league average for team passer ratings was 66 so you automatically know 66 rating in 1970 equals z-score = 0. The standard deviation was 14.34 so one z-score unit in 1970 equals 14.34 passer rating points. If a QB had a rating of 85 in 1970 that’s (85 – 66)/14.34 = 1.325 z-scores above the mean. If it’s below the mean you just have a negative z-score.

How to interpret z-scores? They tell you how “impressive” something was relative to league average, regardless of era or even type of measurement (you actually could compare z-scores of passing stats to z-scores of rushing stats). However, z-scores do not tell you whether an offense with a z-score of 1.325 in 1970 would perform with the same z-score in 2019. There’s no implication about transplanting someone from the past to the present or vice versa, just a measure of how “impressive” something is regardless of era.

One final thing: you could use z-scores to report the "adjusted" rating in some target year, like a 1970's rating in 2019 numbers. Just calculate the z-score than translate to 2019 numbers, no different than going from Celsius to Fahrenheit.

Correlation

A correlation is a measure of how two sets of stats are related to each other. Without getting too technical I’d just interpret it as a measure of how much you can predict the value of one stat from the other stat. If you have a correlation of zero that means you have no ability to predict beyond random guessing. If however you have the maximum possible correlation of 1 or the minimum possible correlation of -1 that means that by knowing one stat you know exactly what happens to the other stat. The only difference between 1 and -1 is that a correlation of 1 means an increase in one stat implies an increase in the other stat, while a correlation of -1 means an increase in one stat implies a decrease in the other.

Thus, correlations range from -1 to 1 and the closer the number is to either -1 or 1 the better you can predict one stat from the other. Two things to remember with correlations: 1) there is no implication of causality, and 2) there is no implied order in the two stats being compared, meaning that the predictive relationship is the same regardless of which stat you use to try to make the prediction.

One more thing about correlations: the degree to which you can predict one stat from the other is actually not the correlation itself but the square of the correlation, which is called r-squared or r^2. So if someone reports a correlation of 0.5 it’s really the square of that number that is meaningful: 0.5^2 = 0.25 because that tells you the percent of variation in one stat you can explain by looking at the other stat (in this case 25% of the variation in one stat is known by looking at the other stat).

Confidence intervals

With almost everything in statistics there is something called a confidence interval, or CI, associated with it, and usually it’s a 95% CI. A 95% CI just specifies the range within which the true value of that statistic lies with 95% probability. The important thing about CI is that it is dependent on sample size, and it’s how you see the effect of sample size on a stat.

To be clear, almost never do you see 95% CI reported in commonly available football stats. That’s because the standards are low. They really should report 95% CI with every stat so you can see how uncertain the estimates are.

To give you an idea of what 95% CI looks like for Tom Brady (only QB I’ve calculated it for), after 1 game played in a season the 95% CI for Brady’s passer rating spans almost 140 passer rating points (70 above and 70 below whatever rating he got in that first game!), after 2 games it goes down to about 40 above and 40 below his passer rating after 2 games, and that 95% CI keeps decreasing as more games are played. In other words, as sample size increases the range within which the “true” passer rating lies keeps shrinking and you can have more confidence that the stat reflects "true" ability.

CI is how sample size affects every single statistic, and I put it here not only because it’s important, but also because it makes explaining the next topic easy.

Hypothesis testing

You often hear that something is “statistically significant”. All that means is that something is too unlikely to have occurred by chance alone. Once you know the 95% CI (see previous section), all you have to do is ask whether the statistic you observed lies within the 95% CI or not. If yes, it is still too likely to have occurred by random variation alone. If no, then it's "statistically significant" and considered too unlikely to have occurred by chance alone.

The choice of using a 95% CI rather than a 99% CI is arbitrary, but it’s the standard in almost every area of science. 95% CI corresponds to a 1 in 20 chance of the event occurring by random variation alone and that's generally unlikely enough in most contexts to say it's "statistically significant". I’ll note however that there are other contexts where the threshold is way higher. Best example is particle physics where the threshold might be at 1 in 3.5 million (5 standard deviations) before it’s statistically significant lol.

OK.. maybe that will get things started.

The Guy · Feb 4, 2020

Great stuff cbrad, thank you.

I'll just add that with regard to the topic of z-scores, a z-score is also a measure of how rare (or common) something is.

Take the following distribution for example -- this is what's known as the normal distribution:

In this particular example of a normal distribution, the x-axis (horizontally along the bottom) consists of IQ scores. The y-axis (along the left side in the vertical plane) consists of the frequency of those IQ scores in the population. So the further you go either right or left along the x-axis, the more rare the IQ score. The more toward the middle of the x-axis, the more common the IQ score. So IQ scores of 145 or more (or 55 or less), which are z-scores of 3.0+ (or -3.0+) are exceedingly rare, whereas IQ scores between let's say 95 and 105 are far more common.

How does this translate to football?

Passer ratings for example are likely distributed normally as in the graph above, where very low and very high passer ratings are rare, and average passer ratings are much more common. If we use career passer ratings to represent quarterbacks' quality, then there have been far more average quarterbacks in the league than there have been extremely good or extremely bad ones.

So with a z-score what we have is a measure of how much something deviates from the norm. An IQ of 145 deviates considerably from the norm, as does a season passer rating of 117.5, a la Ryan Tannehill in 2019. The average passer rating in the league in 2019 was 90.77 with a standard deviation of 10.3, and Tannehill's passer rating of 117.5 was 2.59 z-scores (or standard deviations) above league average. That means it deviated considerably from the league norm.

The next time you go to your doctor and he orders labs for you, if you get a printout of your labs you'll see that there is a normal range for every value. Your white blood cell count for example has a normal range of 4,500 to 11,000 white blood cells per microliter of blood. If you find yourself above or below that normal range, the next question for you should be "how far," and we could answer that with a z-score, which tells us how much your white blood cell count deviates from the norm.

TheHighExhaulted · Feb 4, 2020

Irishman · Feb 4, 2020

cbrad said:

A good suggestion was made by Irishman to create a dedicated thread for the discussion of statistical methodology in football. The idea is to transfer as much of the discussion about statistical methodology to one thread so that it doesn’t clutter up other threads. Results of statistical analysis can still be presented in other threads, but discussion about statistical methodology is probably better off in its own thread.

I’ll start off this thread by giving a simple intro into 4 commonly talked about methods:
1) z-scores
2) correlations
3) confidence intervals
4) hypothesis tests

z-scores

The purpose of z-scores is to take different sets of measurements and put them on the same scale. A common application of z-scores in football is to compare stats across eras.

I’m not sure how best to explain z-scores (for me it would be best to just show the math.. one line and we’re done lol), but one way to think of z-scores is to ask how you would convert between two different scales like Celsius and Fahrenheit. First you need to know what the origin of the scale is (where zero lies): zero degrees Celsius corresponds to 32 degrees Fahrenheit. And then you need to know how the units of measurements compare: every 5 degrees increase in Celsius corresponds to a 9 degree increase in Fahrenheit.

Same thing with z-scores. First you need to know what the origin corresponds to: a z-score of zero always corresponds to league average. Then you need to know the “unit of measurement”, which in this case is the standard deviation (a measure of how far the values go from league average). For those interested in how standard deviation is calculated:
https://en.wikipedia.org/wiki/Standard_deviation

An example of how to convert a stat like passer rating to z-scores. In 1970, the league average for team passer ratings was 66 so you automatically know 66 rating in 1970 equals z-score = 0. The standard deviation was 14.34 so one z-score unit in 1970 equals 14.34 passer rating points. If a QB had a rating of 85 in 1970 that’s (85 – 66)/14.34 = 1.325 z-scores above the mean. If it’s below the mean you just have a negative z-score.

How to interpret z-scores? They tell you how “impressive” something was relative to league average, regardless of era or even type of measurement (you actually could compare z-scores of passing stats to z-scores of rushing stats). However, z-scores do not tell you whether an offense with a z-score of 1.325 in 1970 would perform with the same z-score in 2019. There’s no implication about transplanting someone from the past to the present or vice versa, just a measure of how “impressive” something is regardless of era.

One final thing: you could use z-scores to report the "adjusted" rating in some target year, like a 1970's rating in 2019 numbers. Just calculate the z-score than translate to 2019 numbers, no different than going from Celsius to Fahrenheit.

Correlation

A correlation is a measure of how two sets of stats are related to each other. Without getting too technical I’d just interpret it as a measure of how much you can predict the value of one stat from the other stat. If you have a correlation of zero that means you have no ability to predict beyond random guessing. If however you have the maximum possible correlation of 1 or the minimum possible correlation of -1 that means that by knowing one stat you know exactly what happens to the other stat. The only difference between 1 and -1 is that a correlation of 1 means an increase in one stat implies an increase in the other stat, while a correlation of -1 means an increase in one stat implies a decrease in the other.

Thus, correlations range from -1 to 1 and the closer the number is to either -1 or 1 the better you can predict one stat from the other. Two things to remember with correlations: 1) there is no implication of causality, and 2) there is no implied order in the two stats being compared, meaning that the predictive relationship is the same regardless of which stat you use to try to make the prediction.

One more thing about correlations: the degree to which you can predict one stat from the other is actually not the correlation itself but the square of the correlation, which is called r-squared or r^2. So if someone reports a correlation of 0.5 it’s really the square of that number that is meaningful: 0.5^2 = 0.25 because that tells you the percent of variation in one stat you can explain by looking at the other stat (in this case 25% of the variation in one stat is known by looking at the other stat).

Confidence intervals

With almost everything in statistics there is something called a confidence interval, or CI, associated with it, and usually it’s a 95% CI. A 95% CI just specifies the range within which the true value of that statistic lies with 95% probability. The important thing about CI is that it is dependent on sample size, and it’s how you see the effect of sample size on a stat.

To be clear, almost never do you see 95% CI reported in commonly available football stats. That’s because the standards are low. They really should report 95% CI with every stat so you can see how uncertain the estimates are.

To give you an idea of what 95% CI looks like for Tom Brady (only QB I’ve calculated it for), after 1 game played in a season the 95% CI for Brady’s passer rating spans almost 140 passer rating points (70 above and 70 below whatever rating he got in that first game!), after 2 games it goes down to about 40 above and 40 below his passer rating after 2 games, and that 95% CI keeps decreasing as more games are played. In other words, as sample size increases the range within which the “true” passer rating lies keeps shrinking and you can have more confidence that the stat reflects "true" ability.

CI is how sample size affects every single statistic, and I put it here not only because it’s important, but also because it makes explaining the next topic easy.

Hypothesis testing

You often hear that something is “statistically significant”. All that means is that something is too unlikely to have occurred by chance alone. Once you know the 95% CI (see previous section), all you have to do is ask whether the statistic you observed lies within the 95% CI or not. If yes, it is still too likely to have occurred by random variation alone. If no, then it's "statistically significant" and considered too unlikely to have occurred by chance alone.

The choice of using a 95% CI rather than a 99% CI is arbitrary, but it’s the standard in almost every area of science. 95% CI corresponds to a 1 in 20 chance of the event occurring by random variation alone and that's generally unlikely enough in most contexts to say it's "statistically significant". I’ll note however that there are other contexts where the threshold is way higher. Best example is particle physics where the threshold might be at 1 in 3.5 million (5 standard deviations) before it’s statistically significant lol.

OK.. maybe that will get things started.
Click to expand...

Thanks for that introduction to statistics.

I hope as the off-season doldrums roll on, some of our posters will be able to review this information without bringing in their perceptions (biases) and getting "wound around the axle" with word definitions.

When this information is examined prior to trying to use it, I suspect it will reduce the number of posters feeling they are being attacked for an opinion, when all that was presented was a piece of information about a statistic and its use.

It will be interesting to see what questions come up as a result of this primer on statistics.

Pauly · Feb 4, 2020

Some points regarding passer rating.

1) It was developed before the NFL started tracking “sacks” as a statistic. The NFL wants to compare QBs across all of its history so they will not incorporate sacks into their official method.

2) Passer rating would be improved if sacks were included as negative pass plays. However, because sacks are distributed normally it means that for the majority of large sample situations sacks will not dramatically alter the outcome.

3) As time has progressed the NFL has gotten better at passing the football. Average passer rating has been increasing. To adjust historical records to allow comparison with current ratings use the following formula:
[target year rating] x [current year average/target year average]

4) The NFL made major changes to the passing rules starting in the 1978 season. Data from 1978 onwards has lower standard deviation and is more consistent year to year. If you are comparing current data to historical data it is best to use 1976 onwards as your historical data set.

cbrad · Feb 4, 2020

Pauly said:

3) As time has progressed the NFL has gotten better at passing the football. Average passer rating has been increasing. To adjust historical records to allow comparison with current ratings use the following formula:
[target year rating] x [current year average/target year average]

4) The NFL made major changes to the passing rules starting in the 1978 season. Data from 1978 onwards has lower standard deviation and is more consistent year to year. If you are comparing current data to historical data it is best to use 1976 onwards as your historical data set.
Click to expand...

Let me put these statements into proper context.

Adjusting ratings using z-scores is in almost every case the correct approach (I'll list exceptions at the end of this post). However, IF the standard deviations across years are similar, then adjusting by z-scores is accurately approximated by the formula in #3. However, you cannot use that simple method of adjustment when the standard deviations change, as they did drastically from pre-1978 to post-1978 onwards for passer rating. So if you want to compare passer ratings from 1970 to 2010 or so you HAVE to use z-scores. If however you only care about passer ratings post-1978 then the formula in #3 is fine.

Also, that 1978 boundary is ONLY for passer rating (and also a few other passing stats). It's not a boundary for all stats.

So when should you not use z-scores? When the distribution of stats is highly skewed. The problem with z-scores is that they assume symmetry in the distribution of the stat which is usually correct for football stats with large sample size. But when you take ratios like TD:INT ratio that symmetry disappears and it's better to adjust stats differently. How to do that is on a case by case basis, but in some cases the formula in #3 works if you replace "average" with "median" which is insensitive to skew.

The Guy · Feb 4, 2020

Pauly said:

2) Passer rating would be improved if sacks were included as negative pass plays. However, because sacks are distributed normally it means that for the majority of large sample situations sacks will not dramatically alter the outcome.
Click to expand...

If sacks were weighted by their effect on expected points added and/or win probability, that would likely dramatically alter the outcome. Simply calling a sack a failed pass attempt and subtracting the yards lost on sacks probably wouldn't, however, but we know sacks mean more than just that.

The Guy · Feb 5, 2020

I'd like to say something else about correlation here, and apply it to something seen frequently in the forum.

The following is a graph of the correlation between passer rating differential and win percentage in the NFL for the 480 team seasons from 2004 to 2018:

The magnitude of that correlation is a whopping 0.81 (95% confidence interval is 0.773 to 0.836).

One of the values of a correlation is that it tells us what to expect. With regard to the above correlation for example, if the Dolphins were to achieve a relatively large passer rating differential in any one season, we could expect them to also have a high win percentage. You'd certainly be foolish to bet money on the Dolphins' having a large passer rating differential and at the same time a low win percentage. You'd certainly lose that bet the vast majority of the time.

Now, the correlation above isn't perfect, however. It isn't 1.0. So, there is some "fudge factor" with regard to predicting win percentage on the basis of passer rating differential.

What we sometimes see in the forum is someone's taking the exception to the rule in the case of a strong correlation and stating that it indicates there is no relationship between the two variables involved. In the case of the above information that might sound something like, "well passer rating differential doesn't mean **** with regard to winning, because in 2010 the Raiders had a small passer rating differential and went 11-5 anyway [hypothetically]."

Very few correlations are 1.0, meaning that such exceptions to the rule can always be found. But what we're talking about here is what's highly probable, and you should certainly believe that the Dolphins would be highly probable to have a high win percentage if they had a large passer rating differential.

So, when you take just a single example and say it indicates something about the relationship between two variables, you must also know the larger correlation at hand between those two variables, based on a much larger sample size.

This is the value of statistics.

FinFaninBuffalo · Feb 5, 2020

cbrad said:

Let me put these statements into proper context.

Adjusting ratings using z-scores is in almost every case the correct approach (I'll list exceptions at the end of this post). However, IF the standard deviations across years are similar, then adjusting by z-scores is accurately approximated by the formula in #3. However, you cannot use that simple method of adjustment when the standard deviations change, as they did drastically from pre-1978 to post-1978 onwards for passer rating. So if you want to compare passer ratings from 1970 to 2010 or so you HAVE to use z-scores. If however you only care about passer ratings post-1978 then the formula in #3 is fine.

Also, that 1978 boundary is ONLY for passer rating (and also a few other passing stats). It's not a boundary for all stats.

So when should you not use z-scores? When the distribution of stats is highly skewed. The problem with z-scores is that they assume symmetry in the distribution of the stat which is usually correct for football stats with large sample size. But when you take ratios like TD:INT ratio that symmetry disappears and it's better to adjust stats differently. How to do that is on a case by case basis, but in some cases the formula in #3 works if you replace "average" with "median" which is insensitive to skew.
Click to expand...

I commend you on your ability to provide useful instruction without a hint of condescension. A rare skill on the Internet.

cbrad · Feb 5, 2020

The Guy said:

The following is a graph of the correlation between passer rating differential and win percentage in the NFL for the 480 team seasons from 2004 to 2018: The magnitude of that correlation is a whopping 0.81 (95% confidence interval is 0.773 to 0.836).
Click to expand...

Your graph isn't showing up on my browser for some reason. I've seen that graph before though and it's correct. In any case, let me see if I can post the graph for 1966-2019:

Only thing I'll add to your post is that most stats in the NFL have a linear relationship, like you see in the graph above. That is, the trend isn't some curve but is instead a line which is plotted in dark red. That trend line is what you use for prediction purposes. That is, the correlation only tells you the strength of the relationship, NOT the relationship itself. The nature of the relationship is specified by the best-fitting line or best-fitting curve.

The equation for that best-fitting line between win% and passer rating differential (across NFL history in the SB era) is:

Win% = 0.95*[passer rating differential] + 50

So, for each increase in passer rating differential you increase win% by about 0.95%. In a 16 game season that means that you need to increase passer rating differential by on average about 6.58 to add an extra win.

Oh, and for future reference, the process of finding the best-fitting line to data is called "linear regression".

EDIT: I should be more clear about something. Correlations tell you the strength of the relationship between the variables themselves. They don't tell you how well you could predict Y from X in the most general sense because for prediction purposes you could use some fancy non-linear function of X to predict Y. In other words, correlations tell you the strength of a possible linear relationship between X and Y.

The Guy · Feb 5, 2020

So as cbrad suggested I'm going to move the discussion about QBR to this thread.

This morning I took three hours and 30 minutes and combed through individual ESPN.com box scores for QBR data. I don't wish that on anyone.

Anyway, the question for me was, what is the correlation between QBR differential and points differential in single games in the NFL, and how does that compare to the correlation between traditional passer rating differential and points differential?

So, to gather the data I chose 100 games at random from the 2019 regular season. Randomness was achieved by ordering games on the basis of a variable that's unrelated to either of the ones in question: yards per punt.

So I took the first 100 2019 games generated by Pro Football Reference, again ordered by yards per punt. Here are the results in terms of correlations and 95% confidence intervals (all of the correlations below are significant at the level p < 0.001):

Traditional passer rating differential & points differential: 0.67; [0.547 to 0.766]
QBR differential & points differential: 0.58; [0.406 to 0.680]

Traditional passer rating (offense) & points scored: 0.59; [0.445 to 0.704]
QBR (offense) & points scored: 0.56; [0.406 to 0.680]

Traditional passer rating (defense) & points allowed: 0.71; [0.599 to 0.797]
QBR (defense) & points allowed: 0.63; [0.489 to 0.732]

QBR differential & traditional passer rating differential: 0.60; [0.463 to 0.716]

QBR (offense) & traditional passer rating (offense): 0.66; [0.526 to 0.754]
QBR (defense) & traditional passer rating (defense): 0.73; [0.627 to 0.813]

So as we know, QBR is not transparent. We don't know how the measure is calculated. However, if the measure is calculated in a reliable manner (note the emphasis on the word "if"), then the results above support its intended purpose of teasing the play of quarterbacks apart from that of the rest of their teams, in that all of the QBR correlations are weaker than the traditional passer rating correlations.

What this means of course is that -- again if the measure is calculated reliably -- it's quite possible we have a more valid measure than traditional passer rating of the individual performance of the quarterback, independent of his team.

Of course nothing can tease the performance of the quarterback apart from his team completely, but it is of course possible to design statistical measures that move us further that direction. QBR may in fact be one of those.

cbrad · Feb 5, 2020

The Guy said:

So as we know, QBR is not transparent. We don't know how the measure is calculated. However, if the measure is calculated in a reliable manner (note the emphasis on the word "if"), then the results above support its intended purpose of teasing the play of quarterbacks apart from that of the rest of their teams, in that all of the QBR correlations are weaker than the traditional passer rating correlations.
Click to expand...

First of all, good job getting the data.

However, your logic in this quote is wrong. Are you suggesting that if we create another black box method where the correlation to win% is lower than for QBR that it's an even better measure of QB ability? Maybe if we have a black box method where the correlation is the lowest possible correlation of zero you'd say it's a perfect measure of QB ability? Makes no sense obviously.

Mathematically, this is what is going on. The more parameters you add in a model (the more measurable factors you add) the more noise you'll have from those measurements, and the more noise you have the worse your predictive power will be UNLESS you can overcome that noise with accurate enough mathematical relations in your model.

In practice, for most physical models you reach a limit in your ability to improve predictive power with a single digit number of parameters in the model, or in some cases low double digits. Based on the description of ESPN's QBR and knowing they have 10,000 lines of code that most likely encode tons of conditional statements (e.g., how much credit to give to the QB in one particular situation, etc...) their model is probably swamped by the massive amount of extra noise.

So if I had to guess, the reason for the lower correlations is due to bad modeling, which argues exactly against your conclusion.

Of course, there's no way to tell whether the lower correlation is due to better modeling but less influence from the rest of the team or due to bad modeling because they don't publish their formula, but the most likely explanation is bad modeling because extracting QB ability from the rest of the team is such a difficult problem that the default assumption should be bad modeling.

Also, the only way to really determine whether the correlation to win% is too high or too low is if you had a perfect measure of QB ability and found that measure's correlation to win%. We don't have that of course, so no way to use the correlations to win% to determine to what degree it's estimating individual ability vs. team ability. However, the lower correlations DO mean there's no reason to use QBR for prediction purposes, and that takes away arguably the only potential utility of a black box method.

Summary: ESPN's QBR is best put in the intellectual trash can, not only because it's black box (which means unknown validity) but because it doesn't predict win% as well (meaning it's not even good for prediction purposes).

The Guy · Feb 5, 2020

cbrad said:

First of all, good job getting the data.

However, your logic in this quote is wrong. Are you suggesting that if we create another black box method where the correlation to win% is lower than for QBR that it's an even better measure of QB ability? Maybe if we have a black box method where the correlation is the lowest possible correlation of zero you'd say it's a perfect measure of QB ability? Makes no sense obviously.

Mathematically, this is what is going on. The more parameters you add in a model (the more measurable factors you add) the more noise you'll have from those measurements, and the more noise you have the worse your predictive power will be UNLESS you can overcome that noise with accurate enough mathematical relations in your model.

In practice, for most physical models you reach a limit in your ability to improve predictive power with a single digit number of parameters in the model, or in some cases low double digits. Based on the description of ESPN's QBR and knowing they have 10,000 lines of code that most likely encode tons of conditional statements (e.g., how much credit to give to the QB in one particular situation, etc...) their model is probably swamped by the massive amount of extra noise.

So if I had to guess, the reason for the lower correlations is due to bad modeling, which argues exactly against your conclusion.

Of course, there's no way to tell whether the lower correlation is due to better modeling but less influence from the rest of the team or due to bad modeling because they don't publish their formula, but the most likely explanation is bad modeling because extracting QB ability from the rest of the team is such a difficult problem that the default assumption should be bad modeling.

Also, the only way to really determine whether the correlation to win% is too high or too low is if you had a perfect measure of QB ability and found that measure's correlation to win%. We don't have that of course, so no way to use the correlations to win% to determine to what degree it's estimating individual ability vs. team ability. However, the lower correlations DO mean there's no reason to use QBR for prediction purposes, and that takes away arguably the only potential utility of a black box method.

Summary: ESPN's QBR is best put in the intellectual trash can, not only because it's black box (which means unknown validity) but because it doesn't predict win% as well (meaning it's not even good for prediction purposes).
Click to expand...

Perhaps we should back up and give an overview of what QBR actually is. Here's an explanation:

There are six steps to building QBR:

Each QB "action play" (passes, rushes, sacks, scrambles, or penalties attributable to the QB) is measured in terms of the expected points added (EPA)

Adjust for the difficulty of each play. EPA is adjusted based on the type and depth of a pass, and whether the QB was pressured.

If there is a completion, he only is credited for the typical number of yards after the catch (passer rating takes all yards into effect) based on the type and depth of the pass

There is a discount on garbage time, or a time where the score is out of reach near the end of the game.

Opponent adjustment: More credit is given with tougher defenses and vice versa.

QBR averages the adjusted EPA per play and transforms it to a 0 to 100 scale, with 50 being average.

Click to expand...

Now of course people are free to disagree, but that to me appears to be conceptually quite an appealing formulation. In terms of what isn't included in traditional passer rating, what we've done is 1) add a variable (EPA) that assigns a relative value to yards (i.e., all yards aren't created equal), 2) determine whether the QB was pressured, 3) account for air yards versus yards after the catch, 4) discount performance that's accrued when defenses are essentially allowing the accumulation of yardage ("garbage time"), 5) incorporate the quality of the opposing defense, and 6) add QB runs.

Now if that isn't a laundry list of people's typical complaints about the validity of passer rating on a single-game basis, I don't know what is. How many times have you heard some combination of the following? "But he was pressured all day...his offensive line sucked...his receivers ran for tons of yards after the catch...he was playing against a ****ty defense...he got all his stats during garbage time..." And on and on, ad nauseam.

Now of course it's quite possible the model is overfitted, and unfortunately we can't know that because it's proprietary. But aside from that I wouldn't base the validity of QBR entirely on it's predictive ability with regard to winning and toss it completely based on that. Instead I would explore its validity in relation to 1) its correlation with people's and/or experts' perceptions of quarterbacks' individual ability, and 2) its ability to predict quarterbacks' accomplishments independent of their teams (e.g., being a league all pro on a relatively bad team). We can do both of those things without knowing how to calculate it.

https://en.wikipedia.org/wiki/Total_quarterback_rating

cbrad · Feb 5, 2020

The Guy said:

Perhaps we should back up and give an overview of what QBR actually is. Here's an explanation:

Now of course people are free to disagree, but that to me appears to be conceptually quite an appealing formulation. In terms of what isn't included in traditional passer rating, what we've done is 1) add a variable (EPA) that assigns a relative value to yards (i.e., all yards aren't created equal), 2) determine whether the QB was pressured, 3) account for air yards versus yards after the catch, 4) discount performance that's accrued when defenses are essentially allowing the accumulation of yardage ("garbage time"), and 5) incorporating the quality of the opposing defense.

Now if that isn't a laundry list of people's typical complaints about the validity of passer rating on a single-game basis, I don't know what is. How many times have you heard some combination of the following? "But he was pressured all day...his offensive line sucked...his receivers ran for tons of yards after the catch...he was playing against a ****ty defense...he got all his stats during garbage time..." And on and on, ad nauseam.

Now of course it's quite possible the model is overfitted, and unfortunately we can't know that because it's proprietary. But aside from that I wouldn't base the validity of QBR entirely on it's predictive ability with regard to winning and toss it completely based on that. Instead I would explore its validity in relation to 1) its correlation with people's and/or experts' perceptions of quarterbacks' individual ability, and 2) its ability to predict quarterbacks' accomplishments independent of their teams (e.g., being a league all pro on a relatively bad team). We can do both o those things without knowing how to calculate it.

https://en.wikipedia.org/wiki/Total_quarterback_rating
Click to expand...

It's easy to say you've added X, Y, Z. The difficulty lies in accurate modeling of X, Y, Z. So someone saying they've added X, Y, Z is just a marketing job until they show HOW they did it.

Also.. you can't determine validity through correlations to win% when you can't see HOW the parameters are related in the model. It's only when you can see those relations and your question is whether those relations actually predict something you care about that correlations to win% (or point differential etc..) become useful.

And everything after EPA needs transparency. Too many possible ways to incorporate those parameters, almost all of which will be wrong.

The Guy · Feb 5, 2020

cbrad said:

It's easy to say you've added X, Y, Z. The difficulty lies in accurate modeling of X, Y, Z. So someone saying they've added X, Y, Z is just a marketing job until they show HOW they did it.

Also.. you can't determine validity through correlations to win% when you can't see HOW the parameters are related in the model. It's only when you can see those relations and your question is whether those relations actually predict something you care about that correlations to win% (or point differential etc..) become useful.

And everything after EPA needs transparency. Too many possible ways to incorporate those parameters, almost all of which will be wrong.
Click to expand...

What (if anything) do you make of the fact that the correlations for QBR/points scored and passer rating/points scored are so similar?

cbrad · Feb 5, 2020

The Guy said:

What (if anything) do you make of the fact that the correlations for QBR/points scored and passer rating/points scored are so similar?
Click to expand...

Nothing. It's a large confidence interval. And if it were a small confidence interval it would still be best described as a coincidence given that the other correlations aren't that similar.

The Guy · Feb 6, 2020

cbrad said:

And everything after EPA needs transparency. Too many possible ways to incorporate those parameters, almost all of which will be wrong.
Click to expand...

Here's a finding that bears on the above:

From 2017 to 2019 (N = 100 individual QB seasons) the correlation between QBR and total clutch-weighted EPA is 0.898 (95% CI = 0.849 to 0.931).

So QBR essentially is total clutch-weighted EPA, in that nearly 81% of its variance is associated with it.

Other interesting findings:

Over the same time span the correlation between QBR and clutch-weighted EPA on plays with pass attempts is 0.85 (95% CI = 0.781 to 0.898).

QBR and clutch-weighted EPA through rushes: 0.31 [0.118 to 0.486]

QBR and clutch-weighted EPA (lost) on sacks: 0.15 [-0.059 to 0.340]

QBR and clutch-weighted EPA on penalties: 0.38 [0.190 to 0.541]

QBR and traditional passer rating: 0.78 [0.679 to 0.846]

Traditional passer rating and total clutch-weighted EPA: 0.696 [0.573 to 0.788]

So it appears that while QBR essentially is total clutch-weighted EPA, traditional passer rating is not just total clutch-weighted EPA. Almost 52% of the variance in traditional passer rating is unexplained by total clutch-weighted EPA, while only 19% of QBR is unexplained by it.

cbrad · Feb 6, 2020

The Guy said:

Here's a finding that bears on the above:

From 2017 to 2019 (N = 100 individual QB seasons) the correlation between QBR and total clutch-weighted EPA is 0.898 (95% CI = 0.849 to 0.931).

So QBR essentially is total clutch-weighted EPA, in that nearly 81% of its variance is associated with it.
Click to expand...

Not surprising given that one of the main proprietary components of QBR is how they define clutch!!

So yeah QBR is highly correlated with one key component of QBR lol.

The Guy · Feb 6, 2020

cbrad said:

Not surprising given that one of the main proprietary components of QBR is how they define clutch!!

So yeah QBR is highly correlated with one key component of QBR lol.
Click to expand...

Right, but what this tells us is that the most straightforward "ingredient" in QBR (EPA) accounts for nearly all of its variance! That's highly significant. It means QBR isn't varying more largely as a function of its "black box" features.

Also, we know at least the premise of the clutch-weighting based on the following:

Before moving on to the next play, QBR asks one more question: Did this play come in garbage time?

As we know, amassing yards and points in a blowout does not tell you too much about a quarterback’s true skill. When the game is out of reach, which is measured by a team’s win probability at the start of the play, a quarterback receives less credit than on an otherwise “normal” play. Unlike the initial version of QBR released in 2011, plays are no longer up-weighted for “clutch situations,” but we felt it was important to keep the down-weighting feature.

This process of determining the EPA, dividing credit among the QB and his teammates and then determining the weight of play occurs for every play in which a quarterback is involved. All of these plays are then added together and divided by the total number of clutch-weighted plays to produce a per-play measure of QB efficiency.
Click to expand...

So while we still don't know the calculations involved of course, what we're finding out here is that QBR varies largely as a function of its objectively determined features and not as a function of its subjectively determined ones.

https://www.espn.com/blog/statsinfo...-calculated-we-explain-our-quarterback-rating

cbrad · Feb 6, 2020

The Guy said:

Right, but what this tells us is that the most straightforward "ingredient" in QBR (EPA) accounts for nearly all of its variance! That's highly significant. It means QBR isn't varying more largely as a function of its "black box" features.
Click to expand...

??? I just got through telling you that "clutch-weighted" EPA IS black box because they don't tell you how they define clutch. And how you define "clutch" is subjective. It doesn't matter if you're using win probability as a basis. The question is HOW are you using win probability?

The Guy · Feb 6, 2020

cbrad said:

??? I just got through telling you that "clutch-weighted" EPA IS black box because they don't tell you how they define clutch.
Click to expand...

Right, but is that really something that should prevent us from exploring its validity in the ways I mentioned above, i.e., determining whether it correlates with people's and/or experts' appraisals of QBs' individual ability, and whether it correlates with QBs' accomplishments independent of their teams?

I mean they've obviously determined a non-transparent win probability associated with down-weighting EPA during garbage time situations, but I suspect 1) those situations aren't all that prevalent in the league, given the parity among teams, and 2) the win probability they're using was likely determined on objective grounds and not arbitrarily.

In the end you have likely a minuscule amount of variation in QBR determined by this particular "black box" component, and even that amount of variation is likely objectively determined and not arbitrary.

cbrad · Feb 6, 2020

The Guy said:

Right, but is that really something that should prevent us from exploring its validity in the ways I mentioned above, i.e., determining whether it correlates with people's and/or experts' appraisals of QBs' individual ability, and whether it correlates with QBs' accomplishments independent of their teams?
Click to expand...

Absolutely. I'll say this again. EVERY valid statistical methodology is fully transparent. There is NO exception to that. We know the assumptions and calculations of every single valid method for statistical analysis.

So if it's black box you can immediately reject it for that reason and that reason only.

The Guy · Feb 6, 2020

cbrad said:

Absolutely. I'll say this again. EVERY valid statistical methodology is fully transparent. There is NO exception to that. We know the assumptions and calculations of every single valid method for statistical analysis.

So if it's black box you can immediately reject it for that reason and that reason only.
Click to expand...

Right, but statistical validity and ecological validity are two different things. And in this case we don't even know QBR is statistically invalid, while we can make a case for its ecological validity regardless.

cbrad · Feb 6, 2020

The Guy said:

Right, but statistical validity and ecological validity are two different things.
Click to expand...

If it's black box you can't tell whether it's actually a valid measure of QB ability. That should be obvious. And the goal is to determine if it's a valid measure of QB ability, so "ecological" validity doesn't matter because there is no known "valid" measure of QB ability, not statistical nor based on human eyes. Validity has to be established by showing you are actually removing the effect of the team, and you can't establish that when you don't know what the mathematical relations are in the model.

The Guy · Feb 6, 2020

cbrad said:

If it's black box you can't tell whether it's actually a valid measure of QB ability. That should be obvious. And the goal is to determine if it's a valid measure of QB ability, so "ecological" validity doesn't matter because there is no known "valid" measure of QB ability, not statistical nor based on human eyes. Validity has to be established by showing you are actually removing the effect of the team, and you can't establish that when you don't know what the mathematical relations are in the model.
Click to expand...

So if we had purely a measure of EPA for every quarterback, you'd feel comfortable proceeding with further exploration of its validity with regard to measuring quarterbacks' individual ability, but in the case of something (QBR) that down-weights EPA during garbage time and has 81% of its variance explained by those two variables alone, you aren't?

cbrad · Feb 6, 2020

The Guy said:

So if we had purely a measure of EPA for every quarterback, you'd feel comfortable proceeding with further exploration of its validity with regard to measuring quarterbacks' individual ability, but in the case of something (QBR) that down-weights EPA during garbage time and has 81% of its variance explained by those two variables alone, you aren't?
Click to expand...

You don't "explore" EPA's validity. You know precisely what it measures already so you're done. You know that it's still technically a team measure. There's no way through "exploration" with correlations to figure out how valid EPA is as a measure of QB ability.

The Guy · Feb 6, 2020

cbrad said:

You don't "explore" EPA's validity. You know precisely what it measures already so you're done. You know that it's still technically a team measure. There's no way through "exploration" with correlations to figure out how valid EPA is as a measure of QB ability.
Click to expand...

What I mean is exploring its construct validity, convergent validity, discriminant validity, and criterion validity. All of that is possible without knowing how it's calculated, and in fact you could know how it's calculated and it could have none of those kinds of validity.

We know the thing is essentially EPA with a down-weighting for garbage time. The question is, do those two variables when applied to quarterbacks achieve construct validity, convergent validity, discriminant validity, and criterion validity?

cbrad · Feb 6, 2020

The Guy said:

What I mean is exploring its construct validity, convergent validity, discriminant validity, and criterion validity. All of that is possible without knowing how it's calculated, and in fact you could know how it's calculated and it could have none of those kinds of validity.

We know the thing is essentially EPA with a down-weighting for garbage time. The question is, do those two variables when applied to quarterbacks achieve construct validity, convergent validity, discriminant validity, and criterion validity?
Click to expand...

The only "validity" that matters is construct validity: the degree to which something measures what it claims.

And once you say something is a measure of QB ability you cannot determine construct validity with a black box method. How QBR relates to anything else is completely beside the point if the goal is to suggest it may be a valid measure of QB ability (what you suggested).

The Guy · Feb 6, 2020

cbrad said:

The only "validity" that matters is construct validity: the degree to which something measures what it claims.

And once you say something is a measure of QB ability you cannot determine construct validity with a black box method. How QBR relates to anything else is completely beside the point if the goal is to suggest it may be a valid measure of QB ability (what you suggested).
Click to expand...

You must then believe there is no current valid measure of quarterbacks' individual ability?

cbrad · Feb 6, 2020

The Guy said:

You must then believe there is no current valid measure of quarterbacks' individual ability?
Click to expand...

That's correct. Every "QB" measure people have come up with is technically a team stat. However, at least with transparent stats like passer rating you can look at the formula and see where the confounds come from and what's left out of the formula.

There are statistical techniques that allow you to estimate how much you can "average out" confounds with large sample size and also statistical techniques that allow you to estimate how much of an effect whatever is left out might have. So there's room for analysis of construct validity, but there's no such room with a black box method.

The Guy · Feb 6, 2020

cbrad said:

That's correct. Every "QB" measure people have come up with is technically a team stat. However, at least with transparent stats like passer rating you can look at the formula and see where the confounds come from and what's left out of the formula.

There are statistical techniques that allow you to estimate how much you can "average out" confounds with large sample size and also statistical techniques that allow you to estimate how much of an effect whatever is left out might have. So there's room for analysis of construct validity, but there's no such room with a black box method.
Click to expand...

OK well that I can live with.

I do think however that QBR is more appealing conceptually than traditional passer rating, and when we realize that 81% of its variance is accounted for by nothing more than EPA down-weighted by garbage time performance, I suspect it's more valid for measuring quarterbacks' individual ability as well.

The Guy · Feb 7, 2020

Some other correlations of note from the sample of season QBR ratings from 2017 to 2019:

Adjusted (to 2019) traditional passer rating and win percentage: 0.648 [0.512 to 0.752]

Adjusted traditional passer rating and QBR: 0.781 [0.687 to 0.850]

QBR and win percentage: 0.634 [0.494 to 0.742]

Clutch-weighted EPA and win percentage: 0.603 [0.455 to 0.718]

Finatik · Feb 7, 2020

This is so weird it's like you are arguing with yourself until I noticed that every other post was missing.

cbrad · Feb 7, 2020

Finatik said:

This is so weird it's like you are arguing with yourself until I noticed that every other post was missing.
Click to expand...

Yeah you have to log off and come back to the site without logging on to see those posts lol.

The Guy · Feb 7, 2020

cbrad said:

Yeah you have to log off and come back to the site without logging on to see those posts lol.
Click to expand...

And have fun with that. :tongue2:

djphinfan · Feb 8, 2020

I tried Guys... too smart for my blood..holy ****

The Guy · Feb 8, 2020

Some other correlations and 95% confidence intervals of note for the sample of season QBs from 2017 to 2019:

DVOA and win percentage: 0.63 [0.48 to 0.74]

DVOA and adjusted (to 2019) passer rating: 0.90 [0.85 to 0.93]

DVOA and QBR: 0.85 [0.79 to 0.90]

DVOA and clutch-weighted EPA: 0.76 [0.66 to 0.83]

DVOA and clutch-weighted EPA on plays involving pass attempts: 0.83 [0.76 to 0.89]

(Here is a description of DVOA: https://www.footballoutsiders.com/info/methods)

So to sum this up, the correlations between quarterbacks' win percentage and 1) traditional passer rating, 2) QBR, and 3) DVOA are essentially interchangeable. Total clutch-weighted EPA, which is one of the variables involved in QBR, appears to be substantially more strongly related to QBR (0.90) than it is to either traditional passer rating (0.69) or DVOA (0.76). This of course is intuitive, but it's interesting nonetheless that these metrics appear to be measuring different things about quarterback play.

The Guy · Feb 8, 2020

Expounding on the value of statistics a bit, I met a 71-year-old man yesterday who believes he is terminal for prostate cancer and has five years to live because his prostate-specific antigen (PSA) level had increased from 0.08 to 0.16 in the past year, following his prostatectomy. This man was already depressed about losing his wife and his sister in the past year and was hoping to find, in his words, "at least a companion to spend the rest of my life with, but now that I don't have long to live, I won't be subjecting anybody to that kind of hurt."

The information on the following wepbage -- all statistically derived -- would go a long way toward reassuring this man about his longevity, at least in relation to prostate cancer, and would hopefully encourage him go out and find the companion he wants:

https://www.hopkinsmedicine.org/bra...t-should-i-do-if-my-psa-returns-after-surgery

The Guy · Feb 8, 2020

Here's an interesting article that applies statistics to football:

You're the general manager of an NFL franchise and have been presented with a choice. You can have the best pass-blocking offensive line in the league, or the best pass-rushing defensive line. What would you choose?

Conventional wisdom would probably make you lean toward the pass-rushers. When it comes to the trenches, that's where the stars are. And this last draft class featured five defensive linemen or outside linebackers -- including three in the first four picks -- before the first offensive lineman was selected.

But the numbers? They're screaming in the opposite direction. According to the statistics, pass blocking is more important than pass rushing.
Click to expand...

https://www.espn.com/nfl/story/_/id/26888038/pass-blocking-matters-more-pass-rushing-prove-it

cbrad · Feb 8, 2020

The Guy said:

Here's an interesting article that applies statistics to football:

https://www.espn.com/nfl/story/_/id/26888038/pass-blocking-matters-more-pass-rushing-prove-it
Click to expand...

That one's not so bad. So this means ESPN's football analytics aren't ALWAYS bad lol. They don't tell you precisely how they used player tracking data to determine if a rusher beat a blocker within 2.5 seconds, but it's believable you could do that (I personally would look at the locations of the OL, draw line segments between them to create an "OL wavefront" or something like that, and if someone crosses that you say he beat the blocker). So I think I'm OK with people using Pass Block Win Rate (PBWR) or Pass Rush Win Rate (PRWR).

The interpretations are VERY tricky though. They go through some issues in that article but in general while I think the stats are OK you really have to think carefully about what caused what with OL metrics like that.

cbrad .

The Guy Well-Known Member

TheHighExhaulted Well-Known Member

Irishman Well-Known Member

Pauly Season Ticket Holder

cbrad .

The Guy Well-Known Member

The Guy Well-Known Member

FinFaninBuffalo Well-Known Member

cbrad .

The Guy Well-Known Member

cbrad .

The Guy Well-Known Member

cbrad .

The Guy Well-Known Member

cbrad .

The Guy Well-Known Member

cbrad .

The Guy Well-Known Member

cbrad .

The Guy Well-Known Member

cbrad .

The Guy Well-Known Member

cbrad .

The Guy Well-Known Member

cbrad .

The Guy Well-Known Member

cbrad .

The Guy Well-Known Member

cbrad .

The Guy Well-Known Member

The Guy Well-Known Member

Finatik Season Ticket Holder Staff Member Club Member

cbrad .

The Guy Well-Known Member

djphinfan Season Ticket Holder Club Member

The Guy Well-Known Member

The Guy Well-Known Member

The Guy Well-Known Member

cbrad .

Share This Page