Friday, October 12, 2012

Forecasting player performance: goals

WARNING: This post is going to skew long and won't contain any directly relevant fantasy advice to help you make your transfers this week. It might get a bit nerdy at times, though I'm far from classically trained in statistics so the concepts, if not always the terminology, should be accessible to all. I also won't reach any definitive conclusion as these models are a work in progress. Any input on anything not considered below or anywhere where you disagree with any conclusions can be posted in the comments or over at Shots on Target where you'll find a handy forum to discuss these issues. We're also only looking at forecasting goals here, assists will need a separate post.

Tracking historic success
The way I look at it, there are three distinct ways to track historic success:
  • Historic classic stats (goals, assists etc)
  • Historic fantasy points (similar to classic stats but accounting for all fantasy relevant events)
  • Historic underlying stats (a much deeper understanding of a player's performance including his involvement in different areas of the field, his shots taken, where those shots were taken from, his passes completed etc).
If you've made it this far you're probably comfortable with the fact that the latter is the most useful but you'd be amazed with how many comments I see that player x is in 'form' or that player y is likely to score, essentially based on one or two games in recent memory of either scoring or not.

I must admit that I struggle a bit here as it seems odd to totally ignore production to date when we still aren't totally certain about the relationships we'll explore below. If you were forecasting the chance of a die landing on '4' then, of course, historic data is totally useless, the odds are still 1/6. However, if you were unsure whether or not the die was loaded you might want to adjust your 1/6 estimate at some point once the data sample became significant. We don't have 'loaded' fantasy players but we do have outliers who have consistently shown an ability to out (or under) perform their underlying stats and thus taking some account of the historic production could act as a safety net to make sure we don't judge these players incorrectly. It's not an ideal solution and I'm still not sure if it's required but I do think this data at least deserves to be addressed rather than being simply discounted as unreliable.

Projecting future success
So what do we want to know about an individual player to help us forecast his future success? A few considerations:

1. How many, and what kind of, scoring opportunities is he getting each game?
Through three games this year, Michu had registered 8 shots, 4 of which were on target and all of which hit the back of the net. We will often comment that such a conversion rate is 'unsustainable', but what exactly does this mean? Well, the fact that Michu has hit the target 50% of the time looks about right and shouldn't be of concern. Last season midfielders hit the target around 44% of the time and it's reasonable to suggest that Michu is at, or above, league average. We'll get into individual adjustments below in point 2, but for now, we can conclude that this rate is roughly acceptable. The issue however is Michu's 4 goals from just 4 shots on target. Last season, midfielders converted shots on target to goals at a rate of 25% so we would have expected Michu to have just a single goal, not the four he has registered at this point. We would therefore conclude that, if Michu continues to get chances at his current rate, he should regress to the mean in the coming weeks and won't perform at the same rate as he has to date. Note that we are not saying everything will equal out so that over the season he will necessarily have converted 25% of his shots on target into goals, only that that is the expected outcome from here on.

Now, the next issue to consider is what kind of shots a player is getting. This is intuitive as shots in the box will obviously be converted at a higher rate than those from long range, but this point really needs to be emphasised when you consider the differences. The below table shows the percentage of different shot types converted to goals last season:


Midfielders
Forwards
Total shots - inside box
18%
20%
Total shots - outside box
5%
7%
Shots on target - inside box
35%
39%
Shots on target - outside box
14%
17%
Table 1 - Conversion rates of shot types by position. Generated with HTML Tables

We can see that the differences are dramatic and thus we need to be careful when looking at total shots for players like Cazorla, who are prone to take a pop from well outside the area. Of course, he's very capable of hitting the back of the net from 30 yards, but even the most optimistic of Cazorla fans would have to concede that Fellaini's 30 total shots are quite a lot stronger than Cazorla's, when you factor in that 25 of Fellaini's were taken inside in the area compared to just 10 for the Spaniard. Indeed, using the averages above, and ignoring shots on target for a second, Cazorla would be expected to have scored 2.75 goals (10 shots inside the box*18% + 19 shots outside the box*5%), compared to 4.75 for Fellaini (25*18% + 5*5%).

One potential solution to the above dilemma is to purely look at shots on target, which have the strongest correlation to goals over the course of a season. The correlation between different player stats and goals last season are shown below:

Player Stat
Correlation
Shots on target
91%
'Big chances' (per Opta)
90%
Shots inside the box
87%
Total shots
86%
Touches in opponents' box
78%
Table 2 - Correlation between different player stats and goals. Generated with HTML Tables

Long term I think that exclusively looking at 'shots on target' could be the right answer, but I believe a small adjustment is needed when dealing with small sample sizes. Consider, for example, Papiss Cisse through seven weeks this season. He's registered a very useful 16 shots (12th among forwards), but has managed to hit the target just 4 times (t28th), not scoring in the process. Looking purely at shots on target would suggest that he 'should' have scored somewhere between one and two goals, depending on how clinical you believe he can really be (league average is somewhere around 34% but last season Cisse scored with 57% of all shots on target). The issue is that last season he hit the target with 54% of his shots, and did so with 46% of his shots in the Bundesliga with Freiburg. Therefore, it's likely that his 25% hit-the-target-rate should also improve, perhaps to as high as 50%, which would give him a projected eight shots on target for the year and thus an expected goal haul of between two and four for the year to date. Either way the data suggests he is due for some positive regression, the way we split it just dictates how much.

I would understand if others were keen to just look at shots on target but given the above, so long as we're dealing with small sample sizes, I plan to add a thin layer to the projection model to account for total shots, taking note however to adjust at the lower of a player's hit-the-target rate and the league average (this should hopefully take care of the likes of Suarez, who's never seen a shot he wouldn't take and historically has a poor on-target rate of 36% while at Liverpool).

2. How has he converted these chances in the past?
Let's go back to Cisse for a second. He has 16 shots with 4 on target but has yet to register a goal. We've acknowledged that the outcome likely doesn't match the process if we took his data over a larger sample size, but how can we adjust it? In short we have two options:
  1. adjust player data based on league average conversion rates
  2. adjust player data based on their own individual historic rates
Ideally we'd use the latter for everyone as it's simply not realistic to assume all players are the same, particularly when it comes to actually hitting the target (what happens after you hit the target seems to be more reliant on luck as even the elite players tend to have peaks and troughs, but even so, skill is clearly a factor). The problem though is sample size, or more precisely, useful sample size. Going back to Cisse, how do we deal with his shot data from Freiburg (a mid table team in a good league) or Metz (at the time in the French second tier)? If we simply discount that data then we're left with 13 league games from last season, in which Cisse posted a historically good (and almost certainly unsustainable) conversion rate.

For better or worse, I think for players like Cisse we're left with no choice but to simply use a league average rate. We could consider having different rates for players with varying profiles, but then you get into a potential mess where we're applying judgements on whether Cisse (recently deployed out wide) is a wide forward of a true 'striker' and thus the whole system could get clouded.

To continue using Cisse as the subject, we'd get the below 'expected' goals:
  1. League average rate (table 1): 14 shots inside the box*20% + 2 shots outside the box*7% = 2.9 goals
  2. Cisse's individual on-target rate: 16 total shots*50% on target rate*39% (table 1) = 2.5 goals 
    What then, for players like Fernando Torres, for who we have a reasonable amount of Premier League data? Here I believe we need to use judgement but, for starters, we can take his 28(18) appearances in a Chelsea shirt in the league and in this case add in his time at Liverpool too. If we were trying to forecast every player we'd need to set a fixed set of parameters here, but in reality we are probably comfortable using league average rates for the vast majority of players and then individually deciding historic rates for those players under captain consideration (ie the elite). To finish the example, Torres recorded above average rates while at Liverpool, hitting the target with 47% of his shots, scoring on 19% of all shots and 41% of those on target. At Chelsea - as expected- his numbers have decreased so that he hits the target on just 28% of shots, scoring 8% of all shots and 28% of shots on target. Using the sum of all these gives you rates of 43%, 17% and 39% which are at, or just above league average. This feels about right given the way Torres has displayed elite skills in the past (and he's just 28 remember) but has struggled in a Chelsea shirt for the large part. I would therefore be happy going with these individual marks to assess Torres' outlook.

    The observant reader will note that, even when looking at Cisse's own individual on-target rate, I have still used the league average conversion rate to see how many goals he ultimately is forecast to score. I've settled on that approach because, in my research to date which I will repost soon, I've generally found the amount of control players have over that rate is fairly low. See also some great work here from James Grayson (h/t to 11tegen11 for the tip).

    One of the landmark pieces of research in baseball asserted a similar fact about what happened after the ball left the bat: balls tended to land fair or be out at a fairly random rate for an individual player, but at an approximate constant for the league. Many didn't  - and don't - believe the data to this day but it's been shown that year-on-year players can rank very highly and then very low in terms of getting the ball to land in the field of play and I believe a fuller investigation into shots on target will show a similar result (before any baseball fans jump in here, I understand BABIP is more complicated than that, but for simplicity's sake, I think that's a fair summary).

    Now, kicking a ball is obviously not the same as hitting a ball, but there are stark similarities between the two events. Firstly, many shots take place with very little thought time, especially those played into the box. The skill to get these on target is undoubtable, but the ability to 'place' them in the corner? Less convincing. Second, the positioning of the defense and particularly the goalkeeper is outside of an attacking player's control. This can be in the form of a great save in the top corner, but also from hard shots ricocheting off defenders knees or poor headers looping over a diving keeper. Given that we're often only talking about ~100-140 shots and 10-15 goals in a season, these few anomalous and 'lucky' events can have a huge bearing on the outcome.

    Until I see reason to change it, I will therefore use a league average conversion rate of shots on target into goals, splitting chances between those inside and outside the box.

    Now we've established what a player has done and what he should have done to date on an individual basis, let's turn our attention to what his data means to his team and how this translates to future success.

    3. Who has he faced?
    In the past I have accounted for this simply based on goals scored/conceded but given the reliance on shot data for individuals, it seems like that is the best path to take for teams too.

    The question, yet again, becomes whether we should look at total shots, shots inside the box or shots on target. The answer really lies in a chicken and egg like discussion on what dictates the type of shots a team will take during a game more: an attacking team's desire to take shots inside the box or the defensive team's ability to force long range efforts? That needs a whole other case study, so for now I'm going to crudely assume it's somewhere in the middle. We can generate an expectation of total shots, shots inside the box (and hence outside) as well as shots on target by looking at what, on average, a player/team has done against each opponent compared to the league average. For example, let's assume Southampton are playing West Ham at home this week. The calculation would look something like:


    Note: those opponent averages are as one GW7 and not backdated to when the game took place. Given the risk of small sample, I'm okay with this.

    So, to date, Southampton are underperforming the league average by 8% in total shots (3% in, 13% out). This means that when forecasting games, we would reduce the average shots surrendered by their opponents by 3% for those inside the box and 13% for those outside. With the inside-the-box numbers being so low, we can essentially conclude that Southampton are holding opponents to their average level, at least at home.

    We also need to think about how a team's success impacts an individual player. Previously I have somewhat crudely looked at the percentage of goals a player has 'accounted' for and then used a team's weekly forecast to estimate a player's own success. Instead of goals we can look at shots, but then it starts to get a touch tricky. Let's look at an example (through 7 weeks this season):

    TOTAL SHOTS
    Home
    Away
    Lambert
    13
    3
    Southampton
    61
    31
    Lambert %
    21%
    10%
    INSIDE BOX

    Lambert
    8
    3
    Southampton
    35
    17
    Lambert %
    23%
    18%
    ON TARGET

    Lambert
    5
    2
    Southampton
    20
    10
    Lambert %
    25%
    20%
    Table 4 - Percentage of shots type taken by Rickie Lambert to date HTML Tables

    What do with this data is a dilemma  Should we use all three averages? Just look at shots on target? Create some sort of average? Based on Lambert alone we clearly need to differentiate between home and away data but after that it's less clear. I'd be open to suggestions here, but for ease, if nothing else, my plan is to look at the percentage of a team's shots inside/outside a player has and then use his own individual on-target rate (or where unavailable the league average) to determine how many will hit the target. We then apply this to who the individual player faces this week, or beyond . . .

    The actual forecast
    We can summarise the above points with an example for how we might forecast a player's totals for the upcoming week. Let's stick with Lambert as we have his data to hand, and we'll assume he's facing West Ham at home.

    First, we work out how we think Southampton will fare in the game. To date, away from home, West Ham are surrendering 11 shots inside the box per game and 5.7 outside. Our adjustments from above (table 3) suggest that these totals should be slightly revised downwards, giving us forecast totals of 10.7 shots inside the box and 4.9 outside.

    Of these, Lambert is forecast to have 23% of those shots inside the box, so 2.45, and 19% of those outside the box, so 0.9 (table 4).

    At this stage, we could get involved in Lambert's individual conversion rates, but given that he's spent a good portion of his career to date knocking around the Second Division, he's going to get the league average rate. That means (from table 1), we're giving him 2.45 shots inside the box*20% + 0.9 outside the box*7% for a total of 0.6 forecast goals, which by the way is an excellent number (after all that would equate to 23 goals over a 38 game season).

    So that's how the new player forecast data will work for goals, with a similar approach for assists which I will write up shortly. I realise that this isn't rocket science and probably isn't doing much more than a lot of you already do on your own, but I thought it was important to setout the starting point for a new model, that will hopefully continue to develop over the season.

    On that, I now pass it over to you for a while. How can the model be improved? What extra factors should we include? Should any of the above process be changed? Be gentle, and I look forward to reading your suggestions. Next I'll look at assists and then move onto to tweaking some of the ratios we're going to use, such as the league averages, player historic rates etc. Oh, and congrats for getting through all that if you made it this far.

    9 comments:

    John Doe, 2008 said...

    Brilliant post Chris. Articulates many of the points I have been trying to make here and on Shots on Target much more eloquently than I could ever hope. Really does a fantastic job of laying out a framework for a predictive model while highlighting the questions that need to be answered.

    You reference to BABIP is interesting. I became interested in sabermetrics about 20 years ago, primarily through the works of Bill James. The BABIP research was completely ground breaking and remains one of the core successes of the sabermetric movement. While your summary is fairly accurate, I will point out that one of the key understandings of BABIP is that team defense most certainly does impact BABIP rates. This is logical as once the ball is put in play, the primary impediment to a base hit is the defense's ability to make the play.

    I point that out because I think goaltending may be of a similar influence in football. While a defense may have a major impact on the ability of an opponent to get a shot on target, once that shot IS on target, the only thing stopping a goal is the keeper. Could it be that goaltending is a major factor in opponents shot conversion? Certainly could use some further research.

    My second suggestion would be that the model continue to look at the underlying drivers for key activities (shots, shots on target, key passes, etc.). While this could be version 2(or 3, or 7...), if we could uncover the keys to what drives these activities we could maybe overcome some of our sample size constraints, especially early in a player's season/career.

    Another question I have is when can we make accurate assessments of individual team home/away splits? We'd need to analyze past season's to see how quickly home/away trends surfaced and if those trends remained throughout the season. It may be that a simple league average might be better than trying to use a small sample size for an individual team.

    I have more questions, but figure I should stop before my response drones on longer than your post! Again, excellent work.

    Chris Glover said...

    Great point John - I meant to mention about team defense re BABIP but totally forgot. As a Rays fan I have seen first hand that a lot of the stats which regress pitchers to a league average often don't wont for the likes of Jeremy Hellickson, at least partly because of Tampa's very good and often irregular defensive shifts.

    This is likely very true to football too, as someone like obviously elite shot stopping skills like Al Habsi could well have a significant impact on goals conceded per shot on target. I will look at this data soon to see what we can dig up and see if this difference between facing City or Reading is significant.

    Also a good point on home/away splits. I hesitated to include them because the sample size will never get very large, but then the first player I looked at (Lambert) and the next few all had fairly significant differences so I thought I better include it. When I actually write the formulae in the model I might weight the data to include, for example 50% weight of all shots and 50% of home/road splits, to make sure we're not putting too much emphasis on 4 games worth of away data.

    Thanks for the ideas - much appreciated

    John Doe, 2008 said...

    A couple other ideas that I am (or are planning on working on:

    1. Individual influence on goal and assist rates. How much impact do individuals have on their own assist and scoring rates? It seems there is influence (Suarez being a prime example) but we need to uncover how much is influenced by player capability and how much is influenced by other factors.

    Going back to BABIP, it is interesting that hitters DO have quite a bit of influence on BABIP. Primary factors driving BABIP for a hitter include speed, handedness, and mix of batted ball types (ground balls, fly balls, line drives, etc.). I would be surprised if we didn't see something similar in football.

    2. Influence of teammates on assist rates. If individuals can impact goal rates, then, logically, teammates should influence potential assist rates. Again, the question would be how much is individual skill and how much is essentially random.

    3. Defensive skill and the impact of individual defenders. What is the impact of defenses on goal rates? One would assume quality teams keep goals down by limiting shots on target. But how is this done? Is it done by limiting total shots, by allowing shots in areas unlikely to be accurate, or by a combination of the two? Further, how much of this is consistent and how much is essentially luck?

    Further, how does an individual impact team rates? Do different groups of personnel result in different defensive success?

    It would be much appreciated if you would kindly create and publish some workable models for these questions. I will check back in next week to gauge your progress. :)

    chemikills said...

    Great post, it's made me rethink my index

    Rather than basing it on last years goals/shots on target which as you say, a player doesn't have a very strong control over

    So instead of SoT*(SoT/G)
    I'll do S*(SoT/S)*0.29

    Where SoT/G and SoT/S are from last seasons data which tend to correlate well across seasons (usually with slight improvements)

    I've been rethinking my assist weighting as well (trying to improve it) so looking forward to that next article
    Great work :)

    AnonCargoCult said...

    I've not yet developed a model quite like you're describing, but am clearly heading that way.

    I think for fantasy players, we have to keep things relatively simple, but I also can't help but wonder whether we could factor some of the other stats into this.

    I'd start initially with position. How do these numbers change when playing on the left, rather than the right, for example?

    I'd then look to the inverse. What side are shots being conceded?

    We'd ideally find then that some teams concede more from the side our captain candidate has more success.

    The next step might then be in the buildup. Are certain players seeing success from long balls, while some teams seem week in this area?

    With absolutely NO background in any of this, it seems to me the biggest issue encountered in football statistics will be the problem of sample size. Even if a player has been in the league for a while, I’m guessing there’s a often going to be a significant change in their stats from one team to another.

    Unlike baseball (where I assume a player's stats can remain relatively static, even as he moves from team to team),the team-based nature of football seems to suggest that this isn’t likely in football (as the Torres example seems to bear out).

    I wonder if some day the true value from "footy sabermetrics" might not be dependent on some type of player classification, so that larger sample sizes can be employed.

    If strong enough correlations could be found relating groups of players (playing styles?) to success, then the problem might eventually come down to the proper classification of players.

    John Doe, 2008 said...

    Unlike baseball (where I assume a player's stats can remain relatively static, even as he moves from team to team),the team-based nature of football seems to suggest that this isn’t likely in football (as the Torres example seems to bear out).

    You are correct in that the batter/pitcher duel in baseball is essentially a man against man competition. Once the ball is hit, the team comes into play, but baseball's sabermetric value is mostly tied to the pitcher-batter duel.

    Football is more like basketball and even more accurately American football. It is a series of individual battles wrapped within a team construct. It can be extremely difficult to isolate individual performance when so much of it is due to team contributions.

    With that said, the state of basketball and American football sabermetrics suggest that models can be constructed, especially as more data accumulated moving forward.

    michal schatzky said...

    Chris ,
    Great post and great blog in general.
    I would appreciate your (statistical) insight into a problem I've been wrestling with for a while. "Never Captain Nicky Butt" addressed it today -RVP.
    Do you agreeing with their statistical reasoning, which is basically that RVP has been involved in 40% of United's goals and if this continues he is close to a "must have"?
    Right now I hold him, as well as Berbatov and Tevez ( who is inching towards the door) ,and my team is unbalanced.
    To cash in RVP for a sub 10mil striker, or even Aguero, would free up a lot for money for my midfield.
    I've also noticed that at this stage -gw8 - very few of the players at the top of the leaderboard hold RVP. The balance of their squads is fundamentally different than mine solely because so much of my cash is tied up in one player.
    Of course in another 8 gameweeks it could be the reverse ,with those holding RVP at the top of the board.
    Thanks,
    Mike

    @shots_on_target said...

    Hi Chris.

    First, great post and some really sound principles laid out.

    Sorry it took me so long to make a comment but I have been a bit laid out myself with cold this week.

    It's hard to comment on such a detailed and thorough write up too.

    I agree that just SoT alone is the way forward, a combination to different degrees of all (or as many) of a players underlying is the model we want to aim for.

    In Table 1 how did you determine positions? Was it by FPL catgorisation?

    I think for all players we need to weight or anchor their seasonal shot accuray and goal conversion % to a league avg. for their position type (more than just MID or FWD), a classification we can hopefully determine from their "2nd layer down" of underlying data.

    I really like the adjustment you make for opponent's faced, would be great to test a few ideas out in practice.

    As for the actual forcast, I think that's spot on, and as good as it really needs to be from a fantasy perspective.

    All in all, brilliant.

    Gummi said...

    Another late reader, as I was swamped with work.

    Just brilliant. You are the reason I got thinking deeper about stats and the link to the Fantasy game and you keep on publishing articles that push the limit on the Fantasy "research".