<- read.csv(text = "
df team, wins, draws, losses, goals_for, goals_against
AZ, 20, 7, 7, 68, 35
Ajax, 20, 9, 5, 86, 38
Excelsior, 9, 5, 20, 32, 71
FC Emmen, 6, 10, 18, 33, 65
FC Groningen, 4, 6, 24, 31, 75
FC Twente, 18, 10, 6, 66, 27
FC Utrecht, 15, 9, 10, 55, 50
FC Volendam, 10, 6, 18, 42, 71
Feyenoord, 25, 7, 2, 81, 30
Fortuna Sittard, 10, 6, 18, 39, 62
Go Ahead Eagles, 10, 10, 14, 46, 56
NEC, 8, 15, 11, 42, 45
PSV, 23, 6, 5, 89, 40
RKC Waalwijk, 11, 8, 15, 50, 64
SC Cambuur, 5, 4, 25, 26, 69
Sparta Rotterdam, 17, 8, 9, 60, 37
Vitesse, 10, 10, 14, 45, 50
sc Heerenveen, 12, 10, 12, 44, 50
", strip.white = TRUE)
11 Dataframes: Creating Variables
Here we will continue to work with the example dataframe from Chapter 10:
We can see that we are missing some of the columns we usually see when we look at the standings of an association football league. These are:
- The goal difference.
- The number of points the team has.
- The rankings of the teams in the league.
- The relegation status of each team.
What we will learn in this chapter is how to create these variables from the existing ones we already have.
11.1 Goal Difference
The goal difference in association football is “goals for” minus “goals against”. To get the goal difference then, we subtract df$goals_against
from df$goals_for
:
$goals_for - df$goals_against df
[1] 33 48 -39 -32 -44 39 5 -29 51 -23 -10 -3 49 -14 -43 23 -5 -6
This prints out the goal difference for all 18 teams, but it does not save it in the dataset. To do that, we need to assign this output to a new variable in df
. We do that using the dollar symbol again:
$goal_diff <- df$goals_for - df$goals_against df
Now when we look at our dataset, there is a new variable in it:
df
team wins draws losses goals_for goals_against goal_diff
1 AZ 20 7 7 68 35 33
2 Ajax 20 9 5 86 38 48
3 Excelsior 9 5 20 32 71 -39
4 FC Emmen 6 10 18 33 65 -32
5 FC Groningen 4 6 24 31 75 -44
6 FC Twente 18 10 6 66 27 39
7 FC Utrecht 15 9 10 55 50 5
8 FC Volendam 10 6 18 42 71 -29
9 Feyenoord 25 7 2 81 30 51
10 Fortuna Sittard 10 6 18 39 62 -23
11 Go Ahead Eagles 10 10 14 46 56 -10
12 NEC 8 15 11 42 45 -3
13 PSV 23 6 5 89 40 49
14 RKC Waalwijk 11 8 15 50 64 -14
15 SC Cambuur 5 4 25 26 69 -43
16 Sparta Rotterdam 17 8 9 60 37 23
17 Vitesse 10 10 14 45 50 -5
18 sc Heerenveen 12 10 12 44 50 -6
11.2 Total Points
In association football leagues, a team gets 3 points for a win and 1 point for a draw. They get 0 points for a loss. The formula for calculating the total number of points is then: \text{Points} = 3 \times \text{Wins} + 1 \times \text{Draws} + 0 \times \text{Losses} In our data we observe the number of wins, draws and losses, but not the total points. So we need to create this variable from the other ones using the formula above. We can do this with:
$total_points <- 3 * df$wins + df$draws df
We didn’t need to include df$losses
in the formula because when we multiply it by zero it won’t make a difference.
Let’s take a look at what we’ve created:
c("team", "wins", "draws", "losses", "total_points")] df[,
team wins draws losses total_points
1 AZ 20 7 7 67
2 Ajax 20 9 5 69
3 Excelsior 9 5 20 32
4 FC Emmen 6 10 18 28
5 FC Groningen 4 6 24 18
6 FC Twente 18 10 6 64
7 FC Utrecht 15 9 10 54
8 FC Volendam 10 6 18 36
9 Feyenoord 25 7 2 82
10 Fortuna Sittard 10 6 18 36
11 Go Ahead Eagles 10 10 14 40
12 NEC 8 15 11 39
13 PSV 23 6 5 75
14 RKC Waalwijk 11 8 15 41
15 SC Cambuur 5 4 25 19
16 Sparta Rotterdam 17 8 9 59
17 Vitesse 10 10 14 40
18 sc Heerenveen 12 10 12 46
11.3 Team Ranking
At the moment, the data are sorted alphabetically by team. But usually we see them sorted by their ranking in the league. What we will do now is sort the data by their ranking in the league, and also create a variable that shows the team’s rank in the league.
In association football leagues, teams are ranked by the number of points they accumulated throughout the season. If two teams have the same number of points, we rank teams by their goal difference.1
We can create a ranking by sorting the data by the number of points and goal difference. We can do that using the order()
function in R. The first argument is what we want to sort by (total points). For breaking ties we can include additional arguments. By default, order()
sorts ascending so to sort descending we need to use the option decreasing = TRUE
.
If we use the order function by itself, we get:
order(df$total_points, decreasing = TRUE)
[1] 9 13 2 1 6 16 7 18 14 11 17 12 8 10 3 4 15 5
The 9 at the beginning means that the team with the most points is the one in the 9th position. Let’s check which one that was:
9, ] df[
team wins draws losses goals_for goals_against goal_diff total_points
9 Feyenoord 25 7 2 81 30 51 82
This is correct, because Feyenoord won the competition. The 2nd number 13 means the team in the 13th position (PSV) came second.
To actually sort the data, we need to use this function when specifying the row indices:
<- df[order(df$total_points, decreasing = TRUE), ]
df c("team", "total_points", "goal_diff")] df[,
team total_points goal_diff
9 Feyenoord 82 51
13 PSV 75 49
2 Ajax 69 48
1 AZ 67 33
6 FC Twente 64 39
16 Sparta Rotterdam 59 23
7 FC Utrecht 54 5
18 sc Heerenveen 46 -6
14 RKC Waalwijk 41 -14
11 Go Ahead Eagles 40 -10
17 Vitesse 40 -5
12 NEC 39 -3
8 FC Volendam 36 -29
10 Fortuna Sittard 36 -23
3 Excelsior 32 -39
4 FC Emmen 28 -32
15 SC Cambuur 19 -43
5 FC Groningen 18 -44
If there are ties in total points, the function will keep the initial ordering. Here, both Go Ahead Eagles and Vitesse have 40 points, but Vitesse has a better goal difference (-5 instead of -10). But for the ranking to be correct, we need Vitesse to be ahead of Go Ahead Eagles. To do this, we add df$goal_diff
as another argument to the order()
function. This orders by goal difference whenever there is a tie in points:
<- df[order(df$total_points, df$goal_diff, decreasing = TRUE), ]
df c("team", "total_points", "goal_diff")] df[,
team total_points goal_diff
9 Feyenoord 82 51
13 PSV 75 49
2 Ajax 69 48
1 AZ 67 33
6 FC Twente 64 39
16 Sparta Rotterdam 59 23
7 FC Utrecht 54 5
18 sc Heerenveen 46 -6
14 RKC Waalwijk 41 -14
17 Vitesse 40 -5
11 Go Ahead Eagles 40 -10
12 NEC 39 -3
10 Fortuna Sittard 36 -23
8 FC Volendam 36 -29
3 Excelsior 32 -39
4 FC Emmen 28 -32
15 SC Cambuur 19 -43
5 FC Groningen 18 -44
Now we get the right ordering. We can also confirm that there are no ties in both total points and goal difference.
To create the ranking variable we can simply create a sequence from 1 to 18. We can do this with 1:18
. But another way to get 18 is to use nrow(df)
, which is the number of rows in df
:
$ranking <- 1:nrow(df)
dfc("team", "total_points", "goal_diff", "ranking")] df[,
team total_points goal_diff ranking
9 Feyenoord 82 51 1
13 PSV 75 49 2
2 Ajax 69 48 3
1 AZ 67 33 4
6 FC Twente 64 39 5
16 Sparta Rotterdam 59 23 6
7 FC Utrecht 54 5 7
18 sc Heerenveen 46 -6 8
14 RKC Waalwijk 41 -14 9
17 Vitesse 40 -5 10
11 Go Ahead Eagles 40 -10 11
12 NEC 39 -3 12
10 Fortuna Sittard 36 -23 13
8 FC Volendam 36 -29 14
3 Excelsior 32 -39 15
4 FC Emmen 28 -32 16
15 SC Cambuur 19 -43 17
5 FC Groningen 18 -44 18
You can confirm that this is the correct points, goal difference and rankings by checking the table here.
11.4 Relegation Status
The last variable we will create is the relegation status. In the Eredivisie, the teams ranked 17th and 18th are automatically relegated to the lower “Keuken Kampioen” (Kitchen Champion) league (where Tilburg’s Willem II competed that year). The 16th team enters into a playoff with teams in the Keuken Kampioen league. We will create a character variable with this information.
To do this we first create a variable which is blank everywhere: (""
). We then fill in values depending on the rank of the team using indexing:
$relegation_status <- ""
df$relegation_status[df$ranking < 16] <- "No relegation"
df$relegation_status[df$ranking == 16] <- "Relegation playoffs"
df$relegation_status[df$ranking == 17 |
df$ranking == 18] <- "Automatic relegation"
dfc("team", "ranking", "relegation_status")] df[,
team ranking relegation_status
9 Feyenoord 1 No relegation
13 PSV 2 No relegation
2 Ajax 3 No relegation
1 AZ 4 No relegation
6 FC Twente 5 No relegation
16 Sparta Rotterdam 6 No relegation
7 FC Utrecht 7 No relegation
18 sc Heerenveen 8 No relegation
14 RKC Waalwijk 9 No relegation
17 Vitesse 10 No relegation
11 Go Ahead Eagles 11 No relegation
12 NEC 12 No relegation
10 Fortuna Sittard 13 No relegation
8 FC Volendam 14 No relegation
3 Excelsior 15 No relegation
4 FC Emmen 16 Relegation playoffs
15 SC Cambuur 17 Automatic relegation
5 FC Groningen 18 Automatic relegation
For teams in rank 17 or 18, we use the logical OR operator: If the ranking is equal to 17 or equal to 18, we set the status to “Automatic relegation”.
Writing df$ranking == 17 | df$ranking == 18
inside the brackets is quite long. If you had more numbers you wanted to compare the ranking to, it would become a really long command. Fortunately R has a special operator we can use as a shortcut: the %in%
operator. We can use the %in%
operator to do the same thing as follows:
$relegation_status[df$ranking %in% 17:18] <- "Automatic relegation" df
What is the %in%
doing? When we write a %in% b
we are checking for each element in a
if there is a matching element somewhere in b
. To see this at work, consider the following example:
<- 1:6
a <- c(3, 5, 7)
b %in% b a
[1] FALSE FALSE TRUE FALSE TRUE FALSE
Here the 3rd and 5th element are TRUE
, because the 3rd and 5th element of a
(which are 3 and 5) are somewhere in b
(3 and 5 are in b
, but 1, 2, 4 and 6 aren’t). An equivalent way of doing it (but with more typing) would be to see if a=3 or a=5 or a=7 for each element (i.e. check for a match in any of the elements of b
):
== b[1] | a == b[2] | a == b[3] a
[1] FALSE FALSE TRUE FALSE TRUE FALSE
In the example above, we had df$ranking %in% 17:18
. This returns TRUE
if the team’s ranking was one of 17 or 18 and is FALSE
otherwise.
If they are also tied on their goal difference, we rank them on their goals scored, then head-to-head points, then head-to-head goal difference, then head-to-head away goals scored. If they are still tied they will play another match (possibly ending with penalties) if it is for 1st place, relegation, or a European league; otherwise they will draw a ranking randomly. Fortunately, as we will see we can determine a clear ranking based on only points and goal scored, so we won’t need to code all these possibilities here. But if we wanted to be very exact we should also included these extra possibilities in our code.↩︎