As anyone who follows some of the principals on Fantasy Football Twitter knows, the importance of size among wide receivers is a divisive topic with key members from across the aisle taking sides and dividing themselves into camps of #TeamBigWR and #TeamSmallWR. The schism has become perhaps the second-most factious in the history of the internet.
Well, okay, in reality the size debate is just a relatively amiable disagreement among people who are largely all on the same page. As I mentioned in part 1 of my look at WR size, pretty much all parties agree that, if all else is equal, it’s better to be big than small. And pretty much all parties agree that size is just one piece of a larger puzzle; no one would advocate drafting 6’4” Rueben Randle over 5’10” Antonio Brown, for instance.
Instead, the argument takes place over a very narrow space of belief. Team Big WR believes that wide receiver height, (or weight, or BMI, or other measure of “bigness”), is underrated by the market. Team Small WR believes it is not. (Some members of Team Small WR might even believe that height/weight/BMI/bigness is overrated by the market, though I have not personally seen anyone advancing that theory.)
Today, I’d like to examine that belief with respect to NFL front offices. But first, to establish the terms, I want to copy a chart from the last column. I took all receivers from the last 10 drafts who were drafted in or around the first 100 picks and sorted them into 10-player buckets based on draft position, (slightly more in the case of ties). Here’s the average height of receivers in each bucket:
|Draft Position||Average Height|
Again, this is to provide context to the discussion that follows. Front offices, like the rest of us, recognize that more height is more good and clearly prioritize it in the NFL draft. The receivers taken in the top 8 were three full inches taller than those taken from pick 97-104, and that’s even with tiny Tavon Austin dragging down the average.
But do front offices prioritize height enough? Or do they perhaps still underrate its impact? Should the drop-off in that chart look even steeper than it already does? This is exactly the type of question that can be answered by statistical inference, and several smart people have already attempted to do just that. I will get into what they’ve found, but before I do so, I think it’s important to take a rather large detour into a discussion of the concept of statistical inference in general.
I’m sure you’re all familiar with the saying that there are three types of lies: lies, damned lies, and statistics. I bet most of you have heard the quip that some people use statistics like a drunk uses a light post: for support, not illumination. It’s true that statistics can be misleading, whether intentionally so or not. That’s why I think basic statistical literacy is so important— if you understand the concepts at play, you’re less likely to go astray when it comes time to interpret them.
Now, I’m not a statistician. In fact, my entire experience with statistics, as a branch of mathematics, is a single high school stats course I took well over a decade ago. Still, while actually performing statistical inference may be beyond my meager skills, I do have a strong understanding of the concepts involved. I like to say that I’m not a native, but I speak the language.
When most people use statistical inference, they are really performing something called a “linear regression”, which basically amounts to looking at a big plot of data points, drawing a “line of best fit” that closest approximates the data, and then drawing conclusions from there. The regression itself is the easy part- anyone with Excel or an internet connection and 30 minutes to kill can run regressions to their heart’s content. The difficulty lies entirely in the interpretation.
There are plenty of potential pitfalls in this type of statistical inference, but I would like to briefly highlight five of them. First, data may not actually be linear. Second, data may be linear, but not statistically significant. Third, the data may be linear and statistically significant, but it might not hold any explanatory power. Fourth, the data may be linear, statistically significant, and meaningfully explanatory, but not all that practically significant. Finally, the data may be linear, significant in all senses of the word, and meaningfully explanatory, but may be the result of lurking variables.
Let's break it down one by one. Our goal is to find a relationship that is:
1. Truly Linear...
Often, the data we encounter in our daily lives follows a relatively linear pattern. For instance, through some incomprehensible twist of fate I find myself in a position where I am paid for the fantasy football articles I write. The amount I'm paid has a strong linear relationship with the amount of time I spend writing. The relationship isn't a perfect line— some articles require more hours to complete while others require fewer— but by and large my fantasy earnings can be modeled by a formula Y = B * X, where Y is my take-home pay, X is the number of hours I spend writing, and B is some best-fit average amount I make per hour spent.
Sometimes, however, data does not follow a nice linear pattern. Imagine, for instance, that it is 5 PM and you are a student who has a test tomorrow at 8 AM. You want to know the relationship between how many hours you study and what you would expect to score on the test. At low values, you might expect to see a linear relationship forming. Studying two hours will probably be twice as good as studying one hour. Past a certain point, however, the linear relationship breaks down. If you study for 15 hours, you will probably do pretty poorly on the test, since you’ll be taking it after an all-night cram session with no sleep. If you study for 16 hours, you will get a zero on the test, since you missed it entirely.
Indeed, studying for a test might follow a parabolic relationship, where studying increases scores steadily early on, begins to experience diminishing returns, eventually plateaus, and then thanks to the power of sleep deprivation begins actively decreasing your expected score. There are plenty of other types of relationships, too— exponential relationships, logarithmic relationships, etc. Running a linear regression on data that follows an exponential pattern will still return a result, but it’s up to us to decide whether that’s really the best tool in our tool belt for the task at hand.
2. ... StatisticalLy SignificanT...
Significance testing is often confusing for those unfamiliar with statistical inference, because what a statistician means when he says “significant” is not the same as what the average person means when he or she uses the word.
When reading about statistical significance, you might see people often referring to “p”. You might see a phrase like “results are statistically significant, (p = 0.05)”, or “our p-value of 0.003 easily passes most tests of statistical significance”. “P” is a value between 0 and 1, and it’s essentially an attempt to express mathematically the answer to the question “if there was absolutely no relationship between our variables whatsoever, how likely is it that we would observe this particular set of data entirely by chance?” P is better when it is closer to 0, which means that the relationship was less likely to occur entirely by chance.
For instance, let’s say I have a coin and I want to know if it’s weighted towards heads. I flip the coin once and it comes up heads. Could I say with any authority that my coin is weighted? At this point, of course not; even if my coin is NOT weighted, there’s a 50% chance I would have flipped heads anyway just through sheer, blind, dumb luck. In other words, the p-value on my data right now is 0.5, which is positively massive by statistical standards.
Let’s say, then, that I flip the coin nine more times, and it comes up heads each of those nine times. I now have a data set of ten flips, all of them heads. Now, it’s entirely possible that this data set could have arisen through blind chance. There’s nothing about a standard unweighted coin that would prevent it from flipping heads ten times in a row. The odds of that happening with an unweighted coin, however, are 0.00098, or just under 1-in-1000. This p-value is quite small from a statistical standpoint, and would pass nearly all tests of statistical significance, so inference would suggest that our coin is, in fact, probably weighted toward heads. In other words, it’s not enough to know whether data shows a relationship, we must know how likely that relationship could have resulted from random chance alone.
The most common test of statistical significance involves securing a p-value smaller than 0.05, which corresponds to just a 5% chance that, were our two variables actually unrelated, we would have seen that specific data set purely by chance. In some applications, statisticians might place a higher premium on certainty and use a significance test of p = 0.01, or p = 0.001, or really anything they desire. The biggest key, of course, is determining what an acceptable p-value should be before looking at the data.
But remember what this p-value is really saying for a minute. If I look at twenty sets of data, I’d expect one of them to show a statistically significant relationship just through dumb luck alone. If I took twenty unweighted coins and performed twenty coin-flip trials, I would expect one of them to return a statistically significant “false positive” indicating that that coin was probably weighted. XKCD, as is its wont, has provided us with a humorous illustration of this effect.
I’m not disparaging statistical significance. Obviously it’s an important safeguard to the process of inference. I’m merely raising caution about “shotgun statistics”, whereby someone looks at dozens of variables at once looking for a relationship. If you go looking for a relationship in enough different places, it’s only a matter of time before you find one.
3. ... Meaningfully explanatory...
Another term you’ll hear often if you hang around a statistically-minded crowd is “R^2”, or “R-squared”. Remember that linear regression is a method of taking a random cloud of data points and drawing a line that most closely fits with the data. R^2, then, is the measure of how well the line fits the data. R^2, much like p, always ranges between 0 and 1. Unlike P, where we wanted to get as close as possible to 0, with R^2 the goal is to get as close as possible to 1.
Essentially, R^2 is a mathematical representation of how much of the variation in the data is explained by the variables you tested. An R^2 of 1 says “hey, this data forms an absolutely perfect line”. An R^2 of 0 says “this is basically a completely unrelated cloud of data points that looks absolutely nothing like a line”.
R^2 is essentially a measure of how much knowing one value, (such as height), enables us to predict another, (such as production). An R^2 close to 1 tells us “this input variable is essentially the only thing you ever need to know in order to make accurate predictions”. An R^2 close to 0 basically says “knowing this input variable is essentially useless if your goal is to make accurate predictions”.
Using something called multivariate analysis, it’s possible to measure the impact of multiple different variables at the same time. In this case, each additional variable you add to your model should meaningfully increase the value of the R^2. Using my studying analogy from earlier, let’s say that I track a class of students and want to find the relationship between their time spent studying and their final scores. Let’s say that time spent studying alone produces a statistically significant relationship with an R^2 of 0.36. That’s actually a pretty robust result- it means studying time alone explains 36% of the variation in test scores. Or, in other words, if I knew nothing at all about a student other than how long he spent studying, I could probably make a reasonably informed guess about how well he’d do on the test.
Obviously studying isn’t the only thing that impacts test scores, however. Let’s say, instead, that I used a multivariate analysis that incorporated both time spent studying and student IQ, and this new model resulted in an R^2 of 0.62. In this case, both studying and IQ are meaningfully predictive, and my model is now much improved. Let’s say that I also discover that, by incorporating sodium intake along with IQ and time spent studying, I can increase the value of my R^2 up to 0.63. In this case, adding another variable increased the explanatory power, but the gains were so small that they were not worth the added complexity to my prediction model. Adding extra variables to achieve minimal improvements to R^2 is what’s known as “overfitting” in statistics circles, and should be avoided when possible. In the most extreme case imaginable, if I had a class of 30 students and you allowed me 30 variables to work with, I could produce a model that predicted test scores with an R^2 of 1. Such an unwieldy beast of a model, however, would fall apart entirely once I took it outside the confines of the specific data set I used to train its predictive powers in the first place.
4. ... PracticalLy Significant...
Okay, so let’s say we have a set of data with a linear relationship, a low p-value, and a high R^2. Does that mean it’s useful? Not necessarily, and the distinction gets to the difference between what statisticians mean when they say “significant” and what the average person means when they say “significant”.
Imagine that in NFL history, there were a million receivers who were 6’0”, and they all had exactly 1072 receiving yards every year. Imagine a million more receivers who were 6’1”, and they all had exactly 1073 yards. Imagine a million more 6’2” receivers who always had 1074 yards, and so on and so forth. Because the sample size is so extraordinarily large, we would sail past all possible tests of statistical significance. And since the linear relationship is so strong, we’d produce an R^2 of 1. By all standards, this is a phenomenally robust statistical relationship.
Linear regression, as I’ve mentioned, produces a “line of best fit”. That line follows the formula “y = a + bx”, where “y” is what you’re trying to predict, (in this case, receiving yards), and “x” is what you’re using to predict it, (in this case, receiver height). “A” represents the intercept of the line, which isn’t all that important to us here, and “b” represents the slope of the line. That’s what I want to focus on when I talk about practical significance.
In this absurd hypothetical, the equation for the line of best fit would be y = 1000 + 1x, where “x” is height in inches. In other words, for every additional inch of height, the receiver should be expected to finish the season with 1 more yard. Or, to put it another way, a 6’6” behemoth should be expected to finish the year with a whopping 10 more yards than a 5’8” shrimp.
This finding is significant, (statistically), but is it significant? If I told you that the tallest players in the NFL should finish with a half dozen more yards than the shortest players in the NFL, how excited would you be about that knowledge? More yards is obviously more better, but I doubt many owners are going to be doing cartwheels over what essentially amounts to a few extra hundredths of a fantasy point every week. To fantasy owners, that difference is “insignificant” in the most common sense of the word.
5. ... and Not a result of Lurking Variables.
Okay, so let’s say we are very careful with our statistical inference, and we find a relationship that is linear, statistically significant, meaningfully explanatory, and practically significant. Time for a happy dance, right?
Well… maybe. It’s possible that we’ve stumbled upon a meaningful relationship. It’s also possible that we’ve stumbled across two ultimately unrelated variables that are instead both highly correlated with a third, untested variable. These extra untested variables are known as “lurking variables”, and the best way to illustrate the concept, as usual, is with an example.
Let’s say that I develop a theory that eating dinner too early in the day is actually hazardous to your health. I develop a list of people based on what time they eat dinner, and I plot it against their mortality rate. Let’s say that I find a very robust linear relationship saying the earlier you eat dinner, the more likely you are to die in the coming year. Let’s say this relationship has a very low p-value, a very high R^2, and would be described by the average observer as quite significant. Hypothesis proven, right?
Well… wrong. You see, for a number of reasons, older people are much more likely to eat dinner early than younger people are. Also for a number of reasons, older people are much more likely to die in any given year than younger people are. In this case, eating dinner early does not make you more likely to die soon, but a lurking variable, (being old), makes you both more likely to eat dinner early and to die soon. If you were a perfectly healthy 25-year-old, moving your dinner time from 8 PM to 5 PM would probably have no impact whatsoever on your chances of surviving the year. As the saying goes, "correlation does not necessarily imply causation".
In short, whenever we find a neat relationship between two variables, we must ask ourselves if there isn’t perhaps some third thing we’re overlooking that actually helps clarify the relationship. With respect to receivers and height, the great lurking variable is “talent”, or perhaps "perceptions of talent". Think back to that chart way up at the top of this article listing receiver height by draft position. Taller players were more likely to be drafted earlier. Players drafted earlier are likely to be more talented, or at least likely to be perceived as more talented. More talented players are more likely to get a bunch of yards and touchdowns, (and players who are perceived as more talented are likely to get extra targets as a result of that perception). We can’t simply look at the relationship between height and production without also accounting for these lurking variables.
I like to say that talent is the great confound for exactly this reason. In any analysis of the relationship between a measurable variable and player production, talent is always the lurking variable hanging out behind the scenes, and it is notoriously difficult to account for.
Bringing it Back to the question at hand
I’m sorry for the rather long detour, but I felt this next analysis would be a lot easier with a basic understanding of the concepts at play. This was hardly a comprehensive breakdown of how statistical inference works, but it was enough to help equip us to critically examine claims instead of resigning us to uncritically accept them at face value.
When faced with a statistical claim, we can now ask questions about the p-value, or how likely the relationship would have been to result from chance alone. We can ask about the R^2 and just how much explanatory power the relationship provides. We can ask for the formula of the line of best fit and pay special attention to the slope to see just how much of an impact the relationship really has. We can even think critically about whether we should expect the data to behave linearly, and whether there might not be a lurking variable that helps better clarify the relationship.
Armed with these new skills, let us return to the original question: do NFL front offices place enough of a premium on receiver size? Or do they perhaps undervalue the importance of the trait?
Chase Stuart looked at this question last season, and since he addressed it far more thoroughly and ably than I could, I will merely point you to his analysis.
Chase’s big takeaway was that taller receivers tended to be drafted higher, but when you control for draft position, there was no relationship at all between player height and player production. His p-value was 0.53, far higher than even the most lenient conceivable tests of statistical significance. In an attempt to account for that “talent” lurking variable, he ran a regression of production relative to draft position, and found that adding height as a variable to his model resulted in absolutely no improvement to his R^2.
All of this is pretty strong evidence that front offices were not, in fact, undervaluing receiver height. Or, to borrow his synopsis, “But I see nothing to indicate that short receivers who are highly drafted do any worse than tall receivers who are highly drafted. It’s just that usually, the taller receiver is drafted earlier.”
Now, I mentioned in my first article that “WR size” could mean height, weight, or a combination of the two. Chase’s analysis only looked at height, so it’s possible that weight or “density” would tell a different story. So let’s turn our attention to a different study that claims to find exactly that result.
In response to that analysis by Chase Stuart, Frank DuPont wrote an article last year discussing the statistically significant link between weight and future performance. The article is behind a paywall, but in the interest of open dialogue, Frank has graciously given permission to summarize its findings.
Frank found that draft position was significant at the p = .001 level, which means there was a just a 0.1% chance that the relationship between draft position and production would have been that strong purely by chance if the two were unrelated. He found that a model predicting player production using draft position alone had an R^2 of 0.331, which is to say that draft position alone explained 33.1% of the variation in player production.
Frank also found that, used in conjunction with draft position, weight was significant at the 0.05 level, which means there was just a 5% chance that the relationship between weight and production would have seemed that strong if they weren't actually related. While this is a much lower threshold of certainty, as I mentioned earlier this is the most commonly-accepted standard for statistical significance. Crucially, though, his model explaining player performance using both draft position and weight had an R^2 of just .337, just a few hundredths of a point better than the model that used draft position alone.
Frank even notes the tiny increase in R^2, but points out that, while small, it was still an increase. The problem, of course, is that adding inputs should pretty much always result in some gains, simply because the model better learns from the data available. It’s a question of whether those gains translate to better predictive performance outside of the given data set.
In this case, the gains in predictive performance were so marginal that the interest in minimizing inputs should have taken precedence over the interest in maximizing the R^2. In other words, adding weight as a consideration above and beyond draft position was, in my opinion, almost certainly a case of overfitting the available data.
The other flaw in the analysis is that Frank treated draft position as a linear variable when, in fact, it is not. Think about those old Jimmy Johnson draft trade charts for a minute. Draft value does not drop by a set amount from pick to pick. Instead, it drops by a lot between picks at the top of the draft, and by very little between picks at the bottom of the draft. By nearly any measure, draft value declines exponentially and not linearly.
Chase accounted for this fact by correlating production against draft value instead of ordinal draft position. And when Chase pointed that out to Frank, to Frank’s great credit, he immediately agreed and re-ran the data with that adjustment.
In Frank’s new analysis, the relationship between weight and future production, after controlling for draft position, had a p-value of 0.168, which is not statistically significant at any commonly-accepted threshold of significance. Undeterred, he re-ran the regression using draft position, height, *AND* weight as inputs. When using the three inputs, suddenly both height and weight registered as statistically significant at the p = 0.05 level, which indicates that it’s neither height nor weight that is important, but the interaction between the two. Proponents of BMI as the tool for measuring receiver size rejoice.
The problem, once again, lies in the R^2. The model using draft position alone to predict player performance had an R^2 of 0.383. The model using draft position, height, and weight had an R^2 of 0.392. That’s a gain of just 0.009, and it resulted from adding not just one but two different variables. Again, to my mind the minimal gains in predictive power are not nearly enough to justify the additional variables, and we once again have a model that is guilty of overfitting.
I’m not singling these articles out to call out Frank. I’m a huge fan of his, both on RotoViz and on Twitter, (where he tweets as @FantasyDouche). And, in fact, I would highlight his analysis here as the perfect example of what people should be doing. He was transparent about his processes. He was responsive to criticism. He was happy to share his results in the interest of furthering the discussion. I may disagree with his interpretation of the data, but I strongly admire his openness, his inquisitiveness, his collaborative spirit, and his willingness to admit inconvenient information in search of the truth. I can disagree with him and still steadfastly believe that the fantasy space needs more analysts like him.
Wrapping It Up…
So what does this mean for the player size debate? Does this mean that weight and height are not important for receiver production? Of course not— as I mentioned earlier, bigger and taller receivers are drafted earlier, on average, than shorter and lighter receivers. They go higher specifically because those traits are valuable. But those traits are already accounted for in draft position.
Adding weight and height to a model in addition to a player’s draft position is a way of double-counting those attributes. That would be appropriate if NFL front offices were undervaluing those traits in the draft. But several different analyses approached from several different angles have found that adding those variables produced no meaningful gains over a “draft position only” approach.
Chase Stuart is a very smart guy who believes the draft is an efficient market, receiver size is best measured by height, and it’s being valued properly. Frank DuPont is a very smart guy who believes the draft is an inefficient market, receiver size is best measured by weight, and it’s being valued improperly.
They ran two different analyses. Chase’s analysis was yardage-heavy, valuing a touchdown as equivalent to 20 extra yards of production. Frank’s was touchdown-heavy, valuing a touchdown as equivalent to 60 extra yards. One penalized receivers for interceptions, while the other did not. And both approaches demonstrated that considering weight or height in addition to a player’s draft position provides no tangible advantage when it comes to predicting player outcomes. At the very least, neither approach provided us with enough evidence to reject our null hypothesis that size is valued properly.
At the end of the day, that’s what it comes down to; what is required for us to reject our null hypothesis? That’s how statistical inference works. We begin with the default assumption that there is no meaningful relationship, and we endeavor to hold on to that default assumption until the data demonstrates that it is untenable to do so.
When you think about it, the idea that the NFL is undervaluing receiver size is far more revolutionary than it first seems. After all, NFL franchises employ scores of experienced, well-qualified, highly-compensated people whose full-time jobs are determining what translates to NFL success. The idea that these institutions have overlooked something that a part-time fantasy analyst with a knowledge of statistical inference has stumbled upon is a very big claim.
This is not to say that it’s an impossible claim. We already have evidence that it is not. Until the mid-2000s, baseball front offices absolutely undervalued on-base percentage’s ability to translate into wins, and every fan with a background in statistical inference knew it. The precedent is certainly there. At the same time, we also saw baseball quickly react to this realization to the point where the inefficiency disappeared practically overnight. And we saw more and more baseball front offices begin incorporating statistical inference into their evaluation process as a result.
The evidence required to prove a position should be in proportion to the size of the claim. When making a big claim like “NFL teams undervalue the importance of size at the receiver position”, one needs clear and compelling evidence to support it. And on this subject, what evidence I’ve seen has been neither clear nor compelling. As a result, I hold fast to the null hypothesis; as a result, I believe that the NFL is doing a fine job at valuing size properly among receiver prospects.
… and Putting a Bow on It
Alright, there’s a lot of information here for you to digest, but if you walk away from this article with just one thing, what should it be? Basically, it’s this: height and weight are important, but no more than any other trait. They are undeserving of the pedestal on which they have been placed in many corners of the analytics community. If a receiver is drafted early despite a lack of either attribute, it is because front offices have decided that his other positive traits easily outweigh any deficiencies in those areas. When the NFL spends the #12 draft pick on Odell Beckham, Jr., we probably shouldn’t fret over the fact that he’s “only” 5’11” and 198 pounds. The NFL already considered the fact that he’s “only” 5’11” and 198 pounds, and the NFL decided that his other strengths were more than enough to offset.
Or, to once again quote Chase Stuart: “I see nothing to indicate that [small] receivers who are highly drafted do any worse than [big] receivers who are highly drafted. It’s just that usually, the taller receiver is drafted earlier.”
This certainly won’t be my last look at the subject of receiver height; next planned is a look at whether, based on the last ten years of evidence, the fantasy market is undervaluing size at the receiver position. This is a different claim, and one that is much easier to believe. What will the evidence say? Stay tuned to find out!
More articles from Adam HarstadSee all
More articles on: DynastySee all
Dynasty News, Week 7 - Tefertiller
Dynasty Rankings Movement, Week 7 - Tefertiller
IDP Dynasty Sleepers, Week 7 - Tietgen