11  Dataframes: Creating Variables

Here we will continue to work with the example dataframe from Chapter 10:

df <- read.csv(text = "
            team, wins, draws, losses, goals_for, goals_against
              AZ,   20,     7,      7,        68,            35
            Ajax,   20,     9,      5,        86,            38
       Excelsior,    9,     5,     20,        32,            71
        FC Emmen,    6,    10,     18,        33,            65
    FC Groningen,    4,     6,     24,        31,            75
       FC Twente,   18,    10,      6,        66,            27
      FC Utrecht,   15,     9,     10,        55,            50
     FC Volendam,   10,     6,     18,        42,            71
       Feyenoord,   25,     7,      2,        81,            30
 Fortuna Sittard,   10,     6,     18,        39,            62
 Go Ahead Eagles,   10,    10,     14,        46,            56
             NEC,    8,    15,     11,        42,            45
             PSV,   23,     6,      5,        89,            40
    RKC Waalwijk,   11,     8,     15,        50,            64
      SC Cambuur,    5,     4,     25,        26,            69
Sparta Rotterdam,   17,     8,      9,        60,            37
         Vitesse,   10,    10,     14,        45,            50
   sc Heerenveen,   12,    10,     12,        44,            50
", strip.white = TRUE)

We can see that we are missing some of the columns we usually see when we look at the standings of an association football league. These are:

  1. The goal difference.
  2. The number of points the team has.
  3. The rankings of the teams in the league.
  4. The relegation status of each team.

What we will learn in this chapter is how to create these variables from the existing ones we already have.

11.1 Goal Difference

The goal difference in association football is “goals for” minus “goals against”. To get the goal difference then, we subtract df$goals_against from df$goals_for:

df$goals_for - df$goals_against
 [1]  33  48 -39 -32 -44  39   5 -29  51 -23 -10  -3  49 -14 -43  23  -5  -6

This prints out the goal difference for all 18 teams, but it does not save it in the dataset. To do that, we need to assign this output to a new variable in df. We do that using the dollar symbol again:

df$goal_diff <- df$goals_for - df$goals_against

Now when we look at our dataset, there is a new variable in it:

df
               team wins draws losses goals_for goals_against goal_diff
1                AZ   20     7      7        68            35        33
2              Ajax   20     9      5        86            38        48
3         Excelsior    9     5     20        32            71       -39
4          FC Emmen    6    10     18        33            65       -32
5      FC Groningen    4     6     24        31            75       -44
6         FC Twente   18    10      6        66            27        39
7        FC Utrecht   15     9     10        55            50         5
8       FC Volendam   10     6     18        42            71       -29
9         Feyenoord   25     7      2        81            30        51
10  Fortuna Sittard   10     6     18        39            62       -23
11  Go Ahead Eagles   10    10     14        46            56       -10
12              NEC    8    15     11        42            45        -3
13              PSV   23     6      5        89            40        49
14     RKC Waalwijk   11     8     15        50            64       -14
15       SC Cambuur    5     4     25        26            69       -43
16 Sparta Rotterdam   17     8      9        60            37        23
17          Vitesse   10    10     14        45            50        -5
18    sc Heerenveen   12    10     12        44            50        -6

11.2 Total Points

In association football leagues, a team gets 3 points for a win and 1 point for a draw. They get 0 points for a loss. The formula for calculating the total number of points is then: \text{Points} = 3 \times \text{Wins} + 1 \times \text{Draws} + 0 \times \text{Losses} In our data we observe the number of wins, draws and losses, but not the total points. So we need to create this variable from the other ones using the formula above. We can do this with:

df$total_points <- 3 * df$wins + df$draws

We didn’t need to include df$losses in the formula because when we multiply it by zero it won’t make a difference.

Let’s take a look at what we’ve created:

df[, c("team", "wins", "draws", "losses", "total_points")]
               team wins draws losses total_points
1                AZ   20     7      7           67
2              Ajax   20     9      5           69
3         Excelsior    9     5     20           32
4          FC Emmen    6    10     18           28
5      FC Groningen    4     6     24           18
6         FC Twente   18    10      6           64
7        FC Utrecht   15     9     10           54
8       FC Volendam   10     6     18           36
9         Feyenoord   25     7      2           82
10  Fortuna Sittard   10     6     18           36
11  Go Ahead Eagles   10    10     14           40
12              NEC    8    15     11           39
13              PSV   23     6      5           75
14     RKC Waalwijk   11     8     15           41
15       SC Cambuur    5     4     25           19
16 Sparta Rotterdam   17     8      9           59
17          Vitesse   10    10     14           40
18    sc Heerenveen   12    10     12           46

11.3 Team Ranking

At the moment, the data are sorted alphabetically by team. But usually we see them sorted by their ranking in the league. What we will do now is sort the data by their ranking in the league, and also create a variable that shows the team’s rank in the league.

In association football leagues, teams are ranked by the number of points they accumulated throughout the season. If two teams have the same number of points, we rank teams by their goal difference.1

We can create a ranking by sorting the data by the number of points and goal difference. We can do that using the order() function in R. The first argument is what we want to sort by (total points). For breaking ties we can include additional arguments. By default, order() sorts ascending so to sort descending we need to use the option decreasing = TRUE.

If we use the order function by itself, we get:

order(df$total_points, decreasing = TRUE)
 [1]  9 13  2  1  6 16  7 18 14 11 17 12  8 10  3  4 15  5

The 9 at the beginning means that the team with the most points is the one in the 9th position. Let’s check which one that was:

df[9, ]
       team wins draws losses goals_for goals_against goal_diff total_points
9 Feyenoord   25     7      2        81            30        51           82

This is correct, because Feyenoord won the competition. The 2nd number 13 means the team in the 13th position (PSV) came second.

To actually sort the data, we need to use this function when specifying the row indices:

df <- df[order(df$total_points, decreasing = TRUE), ]
df[, c("team", "total_points", "goal_diff")]
               team total_points goal_diff
9         Feyenoord           82        51
13              PSV           75        49
2              Ajax           69        48
1                AZ           67        33
6         FC Twente           64        39
16 Sparta Rotterdam           59        23
7        FC Utrecht           54         5
18    sc Heerenveen           46        -6
14     RKC Waalwijk           41       -14
11  Go Ahead Eagles           40       -10
17          Vitesse           40        -5
12              NEC           39        -3
8       FC Volendam           36       -29
10  Fortuna Sittard           36       -23
3         Excelsior           32       -39
4          FC Emmen           28       -32
15       SC Cambuur           19       -43
5      FC Groningen           18       -44

If there are ties in total points, the function will keep the initial ordering. Here, both Go Ahead Eagles and Vitesse have 40 points, but Vitesse has a better goal difference (-5 instead of -10). But for the ranking to be correct, we need Vitesse to be ahead of Go Ahead Eagles. To do this, we add df$goal_diff as another argument to the order() function. This orders by goal difference whenever there is a tie in points:

df <- df[order(df$total_points, df$goal_diff, decreasing = TRUE), ]
df[, c("team", "total_points", "goal_diff")]
               team total_points goal_diff
9         Feyenoord           82        51
13              PSV           75        49
2              Ajax           69        48
1                AZ           67        33
6         FC Twente           64        39
16 Sparta Rotterdam           59        23
7        FC Utrecht           54         5
18    sc Heerenveen           46        -6
14     RKC Waalwijk           41       -14
17          Vitesse           40        -5
11  Go Ahead Eagles           40       -10
12              NEC           39        -3
10  Fortuna Sittard           36       -23
8       FC Volendam           36       -29
3         Excelsior           32       -39
4          FC Emmen           28       -32
15       SC Cambuur           19       -43
5      FC Groningen           18       -44

Now we get the right ordering. We can also confirm that there are no ties in both total points and goal difference.

To create the ranking variable we can simply create a sequence from 1 to 18. We can do this with 1:18. But another way to get 18 is to use nrow(df), which is the number of rows in df:

df$ranking <- 1:nrow(df)
df[, c("team", "total_points", "goal_diff", "ranking")]
               team total_points goal_diff ranking
9         Feyenoord           82        51       1
13              PSV           75        49       2
2              Ajax           69        48       3
1                AZ           67        33       4
6         FC Twente           64        39       5
16 Sparta Rotterdam           59        23       6
7        FC Utrecht           54         5       7
18    sc Heerenveen           46        -6       8
14     RKC Waalwijk           41       -14       9
17          Vitesse           40        -5      10
11  Go Ahead Eagles           40       -10      11
12              NEC           39        -3      12
10  Fortuna Sittard           36       -23      13
8       FC Volendam           36       -29      14
3         Excelsior           32       -39      15
4          FC Emmen           28       -32      16
15       SC Cambuur           19       -43      17
5      FC Groningen           18       -44      18

You can confirm that this is the correct points, goal difference and rankings by checking the table here.

11.4 Relegation Status

The last variable we will create is the relegation status. In the Eredivisie, the teams ranked 17th and 18th are automatically relegated to the lower “Keuken Kampioen” (Kitchen Champion) league (where Tilburg’s Willem II competed that year). The 16th team enters into a playoff with teams in the Keuken Kampioen league. We will create a character variable with this information.

To do this we first create a variable which is blank everywhere: (""). We then fill in values depending on the rank of the team using indexing:

df$relegation_status <- ""
df$relegation_status[df$ranking < 16] <- "No relegation"
df$relegation_status[df$ranking == 16] <- "Relegation playoffs"
df$relegation_status[df$ranking == 17 |
                     df$ranking == 18] <- "Automatic relegation"
df[, c("team", "ranking", "relegation_status")]
               team ranking    relegation_status
9         Feyenoord       1        No relegation
13              PSV       2        No relegation
2              Ajax       3        No relegation
1                AZ       4        No relegation
6         FC Twente       5        No relegation
16 Sparta Rotterdam       6        No relegation
7        FC Utrecht       7        No relegation
18    sc Heerenveen       8        No relegation
14     RKC Waalwijk       9        No relegation
17          Vitesse      10        No relegation
11  Go Ahead Eagles      11        No relegation
12              NEC      12        No relegation
10  Fortuna Sittard      13        No relegation
8       FC Volendam      14        No relegation
3         Excelsior      15        No relegation
4          FC Emmen      16  Relegation playoffs
15       SC Cambuur      17 Automatic relegation
5      FC Groningen      18 Automatic relegation

For teams in rank 17 or 18, we use the logical OR operator: If the ranking is equal to 17 or equal to 18, we set the status to “Automatic relegation”.

Writing df$ranking == 17 | df$ranking == 18 inside the brackets is quite long. If you had more numbers you wanted to compare the ranking to, it would become a really long command. Fortunately R has a special operator we can use as a shortcut: the %in% operator. We can use the %in% operator to do the same thing as follows:

df$relegation_status[df$ranking %in% 17:18] <- "Automatic relegation"

What is the %in% doing? When we write a %in% b we are checking for each element in a if there is a matching element somewhere in b. To see this at work, consider the following example:

a <- 1:6
b <- c(3, 5, 7)
a %in% b
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE

Here the 3rd and 5th element are TRUE, because the 3rd and 5th element of a (which are 3 and 5) are somewhere in b (3 and 5 are in b, but 1, 2, 4 and 6 aren’t). An equivalent way of doing it (but with more typing) would be to see if a=3 or a=5 or a=7 for each element (i.e. check for a match in any of the elements of b):

a == b[1] | a == b[2] | a == b[3]
[1] FALSE FALSE  TRUE FALSE  TRUE FALSE

In the example above, we had df$ranking %in% 17:18. This returns TRUE if the team’s ranking was one of 17 or 18 and is FALSE otherwise.


  1. If they are also tied on their goal difference, we rank them on their goals scored, then head-to-head points, then head-to-head goal difference, then head-to-head away goals scored. If they are still tied they will play another match (possibly ending with penalties) if it is for 1st place, relegation, or a European league; otherwise they will draw a ranking randomly. Fortunately, as we will see we can determine a clear ranking based on only points and goal scored, so we won’t need to code all these possibilities here. But if we wanted to be very exact we should also included these extra possibilities in our code.↩︎