Friday, September 12, 2014

Thoughts on Player Value Part II

In the first post on Player Value, we discussed some of the basic types of baseball related data along with the strengths and weaknesses of each type.  If I was to summarize this post as succinctly as possible, I would say that primary data is the most reliable while the more calculations you have to make to produce a given statistic, the more opportunity for error is introduced.  On the other hand, primary data may be limited in what it tells you about a player's performance than calculated data which is why we often prefer calculated data despite the increase opportunity for error.

Offensive statistics have been around since practically the invention of the game, including fairly advanced and calculated measurements.  I recall reading about Slugging Percentage for the first time in the 1961 World Book Encyclopedia Yearbook!  On the other hand, defensive statistics have been extremely limited, probably due to difficulties in measurement.  For decades, defensive measurement was limited to Errors, Chances and Fielding Percentage, measures that I think we would all agree do not begin to quantify the contributions, positive or negative, of a player to defense.  Defense was largely in the eye of the beholder.  The conventional wisdom was that once you one a Gold Glove award for defensive excellence, you could pretty much count on winning it every year for the rest of your career.  You were also more likely to win a Gold Glove if you hit well too!

At the time Moneyball was written, statistics oriented GM"s like Billy Beane had come to totally discount defense in the formula for winning baseball.  Since they had no way to measure it, they preferred to write if off as a parameter to consider, rather than rely on scouting or "the smell test."  Their attitude was, "if you can't measure it, it doesn't exist."

Since then, at least two fairly sophisticated defensive measurement systems have become fairly well known, and now that they have something that can be expressed as a number, the Billy Beane's of the world have suddenly embraced defense as in important contributor to a player's value.  Possibly the best known of these systems is UZR or Ultimate Zone Rating as championed by a website known as Fangraphs.

I do not pretend to completely understand UZR.  I will try to summarize it here, but I highly suggest visiting the Fangraphs site linked to the left and clicking on the Glossary tab at the top of the page, then look for an article called UZR Primer.  That is the most complete explanation of the system that I know of, and it's from the site that champions it.

Basically, UZR is derived from the charted observations of knowledgeable people of each and every ball in play in MLB.  Judgements are made as to type of hit, degree of difficulty fielding it, location.  Frequency of plays made are then compared to historical norms based on multiple years of data.  Throws are also charted as well as limiting the progress of baserunners.  All this data is then quantified into a number which tells you how many outs above or below average a fielder achieves and this is then translated into a form of Win Shares to be added into an overall WAR score.

I think we would all agree that UZR tells us a lot more about a player's defense than the old errors, chances and fielding percentage.  As we learned in Part I of this series, though, the more variables and the more calculations involved in producing a statistic, the higher the probability of introducing error into the process.  The big question about UZR is not that it is measuring the right components of defense, but whether it is doing so accurately.  Fairly wide swings in UZR from year-to-year continue to be problematic for the statistic.

Forget for a minute, that the entire process is base on the subjective opinion of an observer.  The system also has significant structural problems in the calculations themselves.  One big problem with UZR is what I will call Effective Sample Size.  To illustrate this problem, let's look at Hunter Pence who grades out as a roughly average defensive RF, who plays the position exclusively and who plays every game, so his sample size should be as large as you can get for the position.  UZR has 6 degrees of difficulty in it's ratings.  I list them here with the expected percentage of successful plays in parentheses for each:  Impossible(0%), Remote(1-10%), Unlikely(10-40%), Even(40-60%), Likely(60-90%), Routine(90-100%).

Now, I will list Hunter Pence's percentage success rate for each category this year with the number of chances for each in parentheses:  Impossible-0%(56), Remote- 9.1%(11), Unlikely-22.2%(9), Even- 57.1%(7), Likely-100%(15), Routine- 99.2%(264!).

As you can see, the range of plays sorted by degree of difficulty is a reverse bell curve skewed heavily to the routine.  The vast majority of plays are going to be made close to 100% of the time or 0% of the time.  The types of plays that would be expected to separate a good fielder from a bad fielder, Remote, Unlikely, Even and Likely combine to make up a remarkably small sample size, which I would call the Effective Sample Size.  As few as 1 or 2 plays made or not made the Remote category can make a large difference in your final number.  Differences in UZR are derived from remarkably small Effective Sample Sizes!

This explains, at least in part why we see such large variations in UZR from year-to-year.  Add in the impact of injuries and the normal trajectory of aging and it can be almost impossible to factor out whether yearly swings in UZR are due to chance, injury, age, or skills.  It also calls into question whether even 3 years of data is gives you an accurate picture of a player whose physical ability may have eroded over that period of time.

Still, I would much rather know what a player's UZR is than simply his errors, chances and fielding percentage simply because it includes information about important factors in defense even if that information is flawed by the methodology.  On the other hand, I maintain that there is still a significant role for subjective observation, or the "Smell Test" in evaluating the defensive skills of baseball players due to flaws in the methodology of deriving UZR and other defensive metrics.


  1. DrB is surely right, I think, that fielding evaluation has lots of room for subjective error. A book I would recommend to anyone interested is *Beyond Batting Average: Baseball statistics for the 21st century* by Lee Panas (2010); it suggests that one look at the results of several systems to compensate for the subjectivity of each. I would add that although I agree with DrB that using one's eyes to complement the defensive stat numbers, those eyes had better be pretty expert or sufficiently plural (many observations by many people) to be worth much. Tom Tango polls fans about their visual judgments as to fielding, looking for a large sample for increased validity here just as one needs a large sample of ABs or chances in the field to move toward valid stats.

    1. "That using" above should read "in favor of using."

      If one wants to see how erratic the current fielding evaluations are, one can look at Gregor yBlanco, whose defensive WAR over the last three seasons varies wildly on Fangraphs and Baseball Reference, and differs from the reputation he has as an excellent fielder from the many eyewitnesses to his run-saving catches.

  2. I think most of defensive metrics have a lot to do with the speed and agility of the player. Guys like Pence and Blanco are able to get to more balls because of their quickness and it is probably harder to judge the play if they are running full speed at a ball. As opposed to someone who is lightly running and has a good read on it. I heard a stat around 2010 that we had one of the best defensive outfields in baseball and that primarily had to do with guys like Burrell, Huff, Guillen and DeRosa not being able to get to balls because of their lack of speed. If they can't get to the balls then they are committing the errors.

    Players like Mike Trout and Hunter Pence are premium defenders and have a negative dWAR. When you are going full speed like a bat out of hell you tend to make mistakes. This goes to guys in the infield like Brandon Crawford. He gets to balls other fielders just can't and when he has to rush the throws sometimes he gets offline. Now some critics would say just hold the ball but I like the fact he tries really hard to make every play. He makes some incredible plays that barely anybody can.

    Just my take on it. Thoughts?

    1. It seems that defensive metrics do tend to favor players with greater range and tend to punish errors relatively less than what we as fans are used to doing. What I do not know is how different types of plays are weighted. Do you get extra credit if you somehow make a play on a ball that falls into the "impossible" category? Do you get penalized more if you get fumble fingers on a ball from the "routine" category? I haven't been able to scare up that info yet.

    2. The Panas book I mentioned in my previous post explains the various systems. But the short answer is yes, a player gets credited more and penalized less if the play is hard. If on the average a given play is made only 25% of the time, the player gets .75 added to his score if he makes it, and so forth. The total score for all his plays can be compared with a norm based on all plays made on all opportunities to make plays at that field position. The overage leads to a plus rating in the system, and a player who accumulates minuses for not making plays he should will end up with a negative rating in the system. Routine plays normally get made, so not making one gets a heavier penalty than not making an unlikely play.

  3. What type of calculation, if any, is made for positioning? Before my time, but I heard that Mays played a shallow CF that took hits away. His speed and ability to get jumps on a ball enabled him to race back and get deeply hit balls …..

    What type of calculation i shade for "smart"play? We all know he famous Jeter flip; to nail Jiambi at the plate. Some players have a knack for making good, instinctual plays.

    My hunch is that these factors are not being properly factored in a metric ...

    1. Positioning isn't factored in, on the grounds that every play is the result of combining positioning, speed, sureness with the glove, and such factors which can't easily be disassembled. See Panas 108.

    2. I believe Campanari is correct here.