A good suggestion was made by Irishman to create a dedicated thread for the discussion of statistical methodology in football. The idea is to transfer as much of the discussion about statistical methodology to one thread so that it doesn’t clutter up other threads. Results of statistical analysis can still be presented in other threads, but discussion about statistical methodology is probably better off in its own thread.
I’ll start off this thread by giving a simple intro into 4 commonly talked about methods:
1) z-scores
2) correlations
3) confidence intervals
4) hypothesis tests
z-scores
The purpose of z-scores is to take different sets of measurements and put them on the same scale. A common application of z-scores in football is to compare stats across eras.
I’m not sure how best to explain z-scores (for me it would be best to just show the math.. one line and we’re done lol), but one way to think of z-scores is to ask how you would convert between two different scales like Celsius and Fahrenheit. First you need to know what the origin of the scale is (where zero lies): zero degrees Celsius corresponds to 32 degrees Fahrenheit. And then you need to know how the units of measurements compare: every 5 degrees increase in Celsius corresponds to a 9 degree increase in Fahrenheit.
Same thing with z-scores. First you need to know what the origin corresponds to: a z-score of zero always corresponds to league average. Then you need to know the “unit of measurement”, which in this case is the standard deviation (a measure of how far the values go from league average). For those interested in how standard deviation is calculated:
https://en.wikipedia.org/wiki/Standard_deviation
An example of how to convert a stat like passer rating to z-scores. In 1970, the league average for team passer ratings was 66 so you automatically know 66 rating in 1970 equals z-score = 0. The standard deviation was 14.34 so one z-score unit in 1970 equals 14.34 passer rating points. If a QB had a rating of 85 in 1970 that’s (85 – 66)/14.34 = 1.325 z-scores above the mean. If it’s below the mean you just have a negative z-score.
How to interpret z-scores? They tell you how “impressive” something was relative to league average, regardless of era or even type of measurement (you actually could compare z-scores of passing stats to z-scores of rushing stats). However, z-scores do not tell you whether an offense with a z-score of 1.325 in 1970 would perform with the same z-score in 2019. There’s no implication about transplanting someone from the past to the present or vice versa, just a measure of how “impressive” something is regardless of era.
One final thing: you could use z-scores to report the "adjusted" rating in some target year, like a 1970's rating in 2019 numbers. Just calculate the z-score than translate to 2019 numbers, no different than going from Celsius to Fahrenheit.
Correlation
A correlation is a measure of how two sets of stats are related to each other. Without getting too technical I’d just interpret it as a measure of how much you can predict the value of one stat from the other stat. If you have a correlation of zero that means you have no ability to predict beyond random guessing. If however you have the maximum possible correlation of 1 or the minimum possible correlation of -1 that means that by knowing one stat you know exactly what happens to the other stat. The only difference between 1 and -1 is that a correlation of 1 means an increase in one stat implies an increase in the other stat, while a correlation of -1 means an increase in one stat implies a decrease in the other.
Thus, correlations range from -1 to 1 and the closer the number is to either -1 or 1 the better you can predict one stat from the other. Two things to remember with correlations: 1) there is no implication of causality, and 2) there is no implied order in the two stats being compared, meaning that the predictive relationship is the same regardless of which stat you use to try to make the prediction.
One more thing about correlations: the degree to which you can predict one stat from the other is actually not the correlation itself but the square of the correlation, which is called r-squared or r^2. So if someone reports a correlation of 0.5 it’s really the square of that number that is meaningful: 0.5^2 = 0.25 because that tells you the percent of variation in one stat you can explain by looking at the other stat (in this case 25% of the variation in one stat is known by looking at the other stat).
Confidence intervals
With almost everything in statistics there is something called a confidence interval, or CI, associated with it, and usually it’s a 95% CI. A 95% CI just specifies the range within which the true value of that statistic lies with 95% probability. The important thing about CI is that it is dependent on sample size, and it’s how you see the effect of sample size on a stat.
To be clear, almost never do you see 95% CI reported in commonly available football stats. That’s because the standards are low. They really should report 95% CI with every stat so you can see how uncertain the estimates are.
To give you an idea of what 95% CI looks like for Tom Brady (only QB I’ve calculated it for), after 1 game played in a season the 95% CI for Brady’s passer rating spans almost 140 passer rating points (70 above and 70 below whatever rating he got in that first game!), after 2 games it goes down to about 40 above and 40 below his passer rating after 2 games, and that 95% CI keeps decreasing as more games are played. In other words, as sample size increases the range within which the “true” passer rating lies keeps shrinking and you can have more confidence that the stat reflects "true" ability.
CI is how sample size affects every single statistic, and I put it here not only because it’s important, but also because it makes explaining the next topic easy.
Hypothesis testing
You often hear that something is “statistically significant”. All that means is that something is too unlikely to have occurred by chance alone. Once you know the 95% CI (see previous section), all you have to do is ask whether the statistic you observed lies within the 95% CI or not. If yes, it is still too likely to have occurred by random variation alone. If no, then it's "statistically significant" and considered too unlikely to have occurred by chance alone.
The choice of using a 95% CI rather than a 99% CI is arbitrary, but it’s the standard in almost every area of science. 95% CI corresponds to a 1 in 20 chance of the event occurring by random variation alone and that's generally unlikely enough in most contexts to say it's "statistically significant". I’ll note however that there are other contexts where the threshold is way higher. Best example is particle physics where the threshold might be at 1 in 3.5 million (5 standard deviations) before it’s statistically significant lol.
OK.. maybe that will get things started.
Page 1 of 3
-
Great stuff cbrad, thank you.
I'll just add that with regard to the topic of z-scores, a z-score is also a measure of how rare (or common) something is.
Take the following distribution for example -- this is what's known as the normal distribution:
In this particular example of a normal distribution, the x-axis (horizontally along the bottom) consists of IQ scores. The y-axis (along the left side in the vertical plane) consists of the frequency of those IQ scores in the population. So the further you go either right or left along the x-axis, the more rare the IQ score. The more toward the middle of the x-axis, the more common the IQ score. So IQ scores of 145 or more (or 55 or less), which are z-scores of 3.0+ (or -3.0+) are exceedingly rare, whereas IQ scores between let's say 95 and 105 are far more common.
How does this translate to football?
Passer ratings for example are likely distributed normally as in the graph above, where very low and very high passer ratings are rare, and average passer ratings are much more common. If we use career passer ratings to represent quarterbacks' quality, then there have been far more average quarterbacks in the league than there have been extremely good or extremely bad ones.
So with a z-score what we have is a measure of how much something deviates from the norm. An IQ of 145 deviates considerably from the norm, as does a season passer rating of 117.5, a la Ryan Tannehill in 2019. The average passer rating in the league in 2019 was 90.77 with a standard deviation of 10.3, and Tannehill's passer rating of 117.5 was 2.59 z-scores (or standard deviations) above league average. That means it deviated considerably from the league norm.
The next time you go to your doctor and he orders labs for you, if you get a printout of your labs you'll see that there is a normal range for every value. Your white blood cell count for example has a normal range of 4,500 to 11,000 white blood cells per microliter of blood. If you find yourself above or below that normal range, the next question for you should be "how far," and we could answer that with a z-score, which tells us how much your white blood cell count deviates from the norm.DolphinGreg, Surfs Up 99 and cbrad like this. -
-
I hope as the off-season doldrums roll on, some of our posters will be able to review this information without bringing in their perceptions (biases) and getting "wound around the axle" with word definitions.
When this information is examined prior to trying to use it, I suspect it will reduce the number of posters feeling they are being attacked for an opinion, when all that was presented was a piece of information about a statistic and its use.
It will be interesting to see what questions come up as a result of this primer on statistics. -
Some points regarding passer rating.
1) It was developed before the NFL started tracking “sacks” as a statistic. The NFL wants to compare QBs across all of its history so they will not incorporate sacks into their official method.
2) Passer rating would be improved if sacks were included as negative pass plays. However, because sacks are distributed normally it means that for the majority of large sample situations sacks will not dramatically alter the outcome.
3) As time has progressed the NFL has gotten better at passing the football. Average passer rating has been increasing. To adjust historical records to allow comparison with current ratings use the following formula:
[target year rating] x [current year average/target year average]
4) The NFL made major changes to the passing rules starting in the 1978 season. Data from 1978 onwards has lower standard deviation and is more consistent year to year. If you are comparing current data to historical data it is best to use 1976 onwards as your historical data set. -
Adjusting ratings using z-scores is in almost every case the correct approach (I'll list exceptions at the end of this post). However, IF the standard deviations across years are similar, then adjusting by z-scores is accurately approximated by the formula in #3. However, you cannot use that simple method of adjustment when the standard deviations change, as they did drastically from pre-1978 to post-1978 onwards for passer rating. So if you want to compare passer ratings from 1970 to 2010 or so you HAVE to use z-scores. If however you only care about passer ratings post-1978 then the formula in #3 is fine.
Also, that 1978 boundary is ONLY for passer rating (and also a few other passing stats). It's not a boundary for all stats.
So when should you not use z-scores? When the distribution of stats is highly skewed. The problem with z-scores is that they assume symmetry in the distribution of the stat which is usually correct for football stats with large sample size. But when you take ratios like TD:INT ratio that symmetry disappears and it's better to adjust stats differently. How to do that is on a case by case basis, but in some cases the formula in #3 works if you replace "average" with "median" which is insensitive to skew.Last edited: Feb 4, 2020danmarino, FinFaninBuffalo and Irishman like this. -
Mexphin likes this.
-
I'd like to say something else about correlation here, and apply it to something seen frequently in the forum.
The following is a graph of the correlation between passer rating differential and win percentage in the NFL for the 480 team seasons from 2004 to 2018:
The magnitude of that correlation is a whopping 0.81 (95% confidence interval is 0.773 to 0.836).
One of the values of a correlation is that it tells us what to expect. With regard to the above correlation for example, if the Dolphins were to achieve a relatively large passer rating differential in any one season, we could expect them to also have a high win percentage. You'd certainly be foolish to bet money on the Dolphins' having a large passer rating differential and at the same time a low win percentage. You'd certainly lose that bet the vast majority of the time.
Now, the correlation above isn't perfect, however. It isn't 1.0. So, there is some "fudge factor" with regard to predicting win percentage on the basis of passer rating differential.
What we sometimes see in the forum is someone's taking the exception to the rule in the case of a strong correlation and stating that it indicates there is no relationship between the two variables involved. In the case of the above information that might sound something like, "well passer rating differential doesn't mean **** with regard to winning, because in 2010 the Raiders had a small passer rating differential and went 11-5 anyway [hypothetically]."
Very few correlations are 1.0, meaning that such exceptions to the rule can always be found. But what we're talking about here is what's highly probable, and you should certainly believe that the Dolphins would be highly probable to have a high win percentage if they had a large passer rating differential.
So, when you take just a single example and say it indicates something about the relationship between two variables, you must also know the larger correlation at hand between those two variables, based on a much larger sample size.
This is the value of statistics. -
DolphinGreg, Mexphin, Pauly and 3 others like this.
-
Only thing I'll add to your post is that most stats in the NFL have a linear relationship, like you see in the graph above. That is, the trend isn't some curve but is instead a line which is plotted in dark red. That trend line is what you use for prediction purposes. That is, the correlation only tells you the strength of the relationship, NOT the relationship itself. The nature of the relationship is specified by the best-fitting line or best-fitting curve.
The equation for that best-fitting line between win% and passer rating differential (across NFL history in the SB era) is:
Win% = 0.95*[passer rating differential] + 50
So, for each increase in passer rating differential you increase win% by about 0.95%. In a 16 game season that means that you need to increase passer rating differential by on average about 6.58 to add an extra win.
Oh, and for future reference, the process of finding the best-fitting line to data is called "linear regression".
EDIT: I should be more clear about something. Correlations tell you the strength of the relationship between the variables themselves. They don't tell you how well you could predict Y from X in the most general sense because for prediction purposes you could use some fancy non-linear function of X to predict Y. In other words, correlations tell you the strength of a possible linear relationship between X and Y.Last edited: Feb 5, 2020Surfs Up 99, The Guy and danmarino like this. -
So as cbrad suggested I'm going to move the discussion about QBR to this thread.
This morning I took three hours and 30 minutes and combed through individual ESPN.com box scores for QBR data. I don't wish that on anyone.
Anyway, the question for me was, what is the correlation between QBR differential and points differential in single games in the NFL, and how does that compare to the correlation between traditional passer rating differential and points differential?
So, to gather the data I chose 100 games at random from the 2019 regular season. Randomness was achieved by ordering games on the basis of a variable that's unrelated to either of the ones in question: yards per punt.
So I took the first 100 2019 games generated by Pro Football Reference, again ordered by yards per punt. Here are the results in terms of correlations and 95% confidence intervals (all of the correlations below are significant at the level p < 0.001):
Traditional passer rating differential & points differential: 0.67; [0.547 to 0.766]
QBR differential & points differential: 0.58; [0.406 to 0.680]
Traditional passer rating (offense) & points scored: 0.59; [0.445 to 0.704]
QBR (offense) & points scored: 0.56; [0.406 to 0.680]
Traditional passer rating (defense) & points allowed: 0.71; [0.599 to 0.797]
QBR (defense) & points allowed: 0.63; [0.489 to 0.732]
QBR differential & traditional passer rating differential: 0.60; [0.463 to 0.716]
QBR (offense) & traditional passer rating (offense): 0.66; [0.526 to 0.754]
QBR (defense) & traditional passer rating (defense): 0.73; [0.627 to 0.813]
So as we know, QBR is not transparent. We don't know how the measure is calculated. However, if the measure is calculated in a reliable manner (note the emphasis on the word "if"), then the results above support its intended purpose of teasing the play of quarterbacks apart from that of the rest of their teams, in that all of the QBR correlations are weaker than the traditional passer rating correlations.
What this means of course is that -- again if the measure is calculated reliably -- it's quite possible we have a more valid measure than traditional passer rating of the individual performance of the quarterback, independent of his team.
Of course nothing can tease the performance of the quarterback apart from his team completely, but it is of course possible to design statistical measures that move us further that direction. QBR may in fact be one of those. -
However, your logic in this quote is wrong. Are you suggesting that if we create another black box method where the correlation to win% is lower than for QBR that it's an even better measure of QB ability? Maybe if we have a black box method where the correlation is the lowest possible correlation of zero you'd say it's a perfect measure of QB ability? Makes no sense obviously.
Mathematically, this is what is going on. The more parameters you add in a model (the more measurable factors you add) the more noise you'll have from those measurements, and the more noise you have the worse your predictive power will be UNLESS you can overcome that noise with accurate enough mathematical relations in your model.
In practice, for most physical models you reach a limit in your ability to improve predictive power with a single digit number of parameters in the model, or in some cases low double digits. Based on the description of ESPN's QBR and knowing they have 10,000 lines of code that most likely encode tons of conditional statements (e.g., how much credit to give to the QB in one particular situation, etc...) their model is probably swamped by the massive amount of extra noise.
So if I had to guess, the reason for the lower correlations is due to bad modeling, which argues exactly against your conclusion.
Of course, there's no way to tell whether the lower correlation is due to better modeling but less influence from the rest of the team or due to bad modeling because they don't publish their formula, but the most likely explanation is bad modeling because extracting QB ability from the rest of the team is such a difficult problem that the default assumption should be bad modeling.
Also, the only way to really determine whether the correlation to win% is too high or too low is if you had a perfect measure of QB ability and found that measure's correlation to win%. We don't have that of course, so no way to use the correlations to win% to determine to what degree it's estimating individual ability vs. team ability. However, the lower correlations DO mean there's no reason to use QBR for prediction purposes, and that takes away arguably the only potential utility of a black box method.
Summary: ESPN's QBR is best put in the intellectual trash can, not only because it's black box (which means unknown validity) but because it doesn't predict win% as well (meaning it's not even good for prediction purposes). -
Now if that isn't a laundry list of people's typical complaints about the validity of passer rating on a single-game basis, I don't know what is. How many times have you heard some combination of the following? "But he was pressured all day...his offensive line sucked...his receivers ran for tons of yards after the catch...he was playing against a ****ty defense...he got all his stats during garbage time..." And on and on, ad nauseam.
Now of course it's quite possible the model is overfitted, and unfortunately we can't know that because it's proprietary. But aside from that I wouldn't base the validity of QBR entirely on it's predictive ability with regard to winning and toss it completely based on that. Instead I would explore its validity in relation to 1) its correlation with people's and/or experts' perceptions of quarterbacks' individual ability, and 2) its ability to predict quarterbacks' accomplishments independent of their teams (e.g., being a league all pro on a relatively bad team). We can do both of those things without knowing how to calculate it.
https://en.wikipedia.org/wiki/Total_quarterback_ratingLast edited: Feb 5, 2020 -
Also.. you can't determine validity through correlations to win% when you can't see HOW the parameters are related in the model. It's only when you can see those relations and your question is whether those relations actually predict something you care about that correlations to win% (or point differential etc..) become useful.
And everything after EPA needs transparency. Too many possible ways to incorporate those parameters, almost all of which will be wrong. -
From 2017 to 2019 (N = 100 individual QB seasons) the correlation between QBR and total clutch-weighted EPA is 0.898 (95% CI = 0.849 to 0.931).
So QBR essentially is total clutch-weighted EPA, in that nearly 81% of its variance is associated with it.
Other interesting findings:
Over the same time span the correlation between QBR and clutch-weighted EPA on plays with pass attempts is 0.85 (95% CI = 0.781 to 0.898).
QBR and clutch-weighted EPA through rushes: 0.31 [0.118 to 0.486]
QBR and clutch-weighted EPA (lost) on sacks: 0.15 [-0.059 to 0.340]
QBR and clutch-weighted EPA on penalties: 0.38 [0.190 to 0.541]
QBR and traditional passer rating: 0.78 [0.679 to 0.846]
Traditional passer rating and total clutch-weighted EPA: 0.696 [0.573 to 0.788]
So it appears that while QBR essentially is total clutch-weighted EPA, traditional passer rating is not just total clutch-weighted EPA. Almost 52% of the variance in traditional passer rating is unexplained by total clutch-weighted EPA, while only 19% of QBR is unexplained by it. -
Also, we know at least the premise of the clutch-weighting based on the following:
https://www.espn.com/blog/statsinfo...-calculated-we-explain-our-quarterback-rating -
danmarino likes this.
-
I mean they've obviously determined a non-transparent win probability associated with down-weighting EPA during garbage time situations, but I suspect 1) those situations aren't all that prevalent in the league, given the parity among teams, and 2) the win probability they're using was likely determined on objective grounds and not arbitrarily.
In the end you have likely a minuscule amount of variation in QBR determined by this particular "black box" component, and even that amount of variation is likely objectively determined and not arbitrary. -
So if it's black box you can immediately reject it for that reason and that reason only. -
-
-
Irishman likes this.
-
We know the thing is essentially EPA with a down-weighting for garbage time. The question is, do those two variables when applied to quarterbacks achieve construct validity, convergent validity, discriminant validity, and criterion validity? -
And once you say something is a measure of QB ability you cannot determine construct validity with a black box method. How QBR relates to anything else is completely beside the point if the goal is to suggest it may be a valid measure of QB ability (what you suggested).Irishman likes this. -
There are statistical techniques that allow you to estimate how much you can "average out" confounds with large sample size and also statistical techniques that allow you to estimate how much of an effect whatever is left out might have. So there's room for analysis of construct validity, but there's no such room with a black box method. -
I do think however that QBR is more appealing conceptually than traditional passer rating, and when we realize that 81% of its variance is accounted for by nothing more than EPA down-weighted by garbage time performance, I suspect it's more valid for measuring quarterbacks' individual ability as well. -
Some other correlations of note from the sample of season QBR ratings from 2017 to 2019:
Adjusted (to 2019) traditional passer rating and win percentage: 0.648 [0.512 to 0.752]
Adjusted traditional passer rating and QBR: 0.781 [0.687 to 0.850]
QBR and win percentage: 0.634 [0.494 to 0.742]
Clutch-weighted EPA and win percentage: 0.603 [0.455 to 0.718] -
Some other correlations and 95% confidence intervals of note for the sample of season QBs from 2017 to 2019:
DVOA and win percentage: 0.63 [0.48 to 0.74]
DVOA and adjusted (to 2019) passer rating: 0.90 [0.85 to 0.93]
DVOA and QBR: 0.85 [0.79 to 0.90]
DVOA and clutch-weighted EPA: 0.76 [0.66 to 0.83]
DVOA and clutch-weighted EPA on plays involving pass attempts: 0.83 [0.76 to 0.89]
(Here is a description of DVOA: https://www.footballoutsiders.com/info/methods)
So to sum this up, the correlations between quarterbacks' win percentage and 1) traditional passer rating, 2) QBR, and 3) DVOA are essentially interchangeable. Total clutch-weighted EPA, which is one of the variables involved in QBR, appears to be substantially more strongly related to QBR (0.90) than it is to either traditional passer rating (0.69) or DVOA (0.76). This of course is intuitive, but it's interesting nonetheless that these metrics appear to be measuring different things about quarterback play. -
Expounding on the value of statistics a bit, I met a 71-year-old man yesterday who believes he is terminal for prostate cancer and has five years to live because his prostate-specific antigen (PSA) level had increased from 0.08 to 0.16 in the past year, following his prostatectomy. This man was already depressed about losing his wife and his sister in the past year and was hoping to find, in his words, "at least a companion to spend the rest of my life with, but now that I don't have long to live, I won't be subjecting anybody to that kind of hurt."
The information on the following wepbage -- all statistically derived -- would go a long way toward reassuring this man about his longevity, at least in relation to prostate cancer, and would hopefully encourage him go out and find the companion he wants:
https://www.hopkinsmedicine.org/bra...t-should-i-do-if-my-psa-returns-after-surgery -
Here's an interesting article that applies statistics to football:
-
The interpretations are VERY tricky though. They go through some issues in that article but in general while I think the stats are OK you really have to think carefully about what caused what with OL metrics like that.The Guy likes this.
Page 1 of 3