10  Dataframes: Indexing

In Chapter 4 and Chapter 8 we encountered dataframes. In the following three chapters we will learn more about how to work with them. This chapter we will be look at indexing with dataframes.

10.1 Running Example: The Eredivisie Results from 2022/23

The dataset we will work with in the next three chapters contains the team, number of wins, draws, losses, goals for, and goals against for all 18 teams and 38 matches from the 2022/23 season of the Eredivisie. The Eredivisie is the top Dutch association football league. Here, “goals for” means the total number of goals the team scored that season, whereas “goals against” is the total number of goals the team conceded that season.

You can copy the code chunk below directly into R and it will read the data in as a data.frame. This is another way of reading in datasets into R. You can provide the contents of a CSV file directly into the read.csv() function we have seen before instead of giving the filename. To do this we need to use the text option. I am doing it this way to save you time having to copy the data and save a new file on your computer, but it’s also good to see other ways of reading in data.

df <- read.csv(text = "
            team, wins, draws, losses, goals_for, goals_against
              AZ,   20,     7,      7,        68,            35
            Ajax,   20,     9,      5,        86,            38
       Excelsior,    9,     5,     20,        32,            71
        FC Emmen,    6,    10,     18,        33,            65
    FC Groningen,    4,     6,     24,        31,            75
       FC Twente,   18,    10,      6,        66,            27
      FC Utrecht,   15,     9,     10,        55,            50
     FC Volendam,   10,     6,     18,        42,            71
       Feyenoord,   25,     7,      2,        81,            30
 Fortuna Sittard,   10,     6,     18,        39,            62
 Go Ahead Eagles,   10,    10,     14,        46,            56
             NEC,    8,    15,     11,        42,            45
             PSV,   23,     6,      5,        89,            40
    RKC Waalwijk,   11,     8,     15,        50,            64
      SC Cambuur,    5,     4,     25,        26,            69
Sparta Rotterdam,   17,     8,      9,        60,            37
         Vitesse,   10,    10,     14,        45,            50
   sc Heerenveen,   12,    10,     12,        44,            50
", strip.white = TRUE)

Note: The strip.white option I have used in this command is to remove the empty spaces before and after the team names.

10.2 Indexing with Dataframes

In Chapter 5 we learned that we can get the 3rd element of a vector a with a[3]. We can also extract elements of a ataframe in a similar way. To get the 2nd row and 3rd column of a dataframe, we do:

df[2, 3]
[1] 9

Inside the square bracket we first specify the rows, then after a comma we specify the columns.

We can also put multiple indexes in each part. Suppose we want a smaller dataframe of only the rows with Ajax, Feyenoord and PSV and only the columns with the team name and number of wins. We first check which rows those teams occupy (2, 9 and 13) and which columns those variables are in (1 and 2). We then do:

df[c(2, 9, 13), c(1, 2)]
        team wins
2       Ajax   20
9  Feyenoord   25
13       PSV   23

If we leave the columns part blank, it will give us the entire row. For example, to get all the results for just Ajax we just get the 2nd row:

df[2, ]
  team wins draws losses goals_for goals_against
2 Ajax   20     9      5        86            38

Similarly, if we leave the row part blank and only give column indices, it will give us all rows for those columns. If we just want the column of wins we can do:

df[, 2]
 [1] 20 20  9  6  4 18 15 10 25 10 10  8 23 11  5 17 10 12

We can also get a column of a dataframe using the name of the variable. For example, if we want to get the goals_for column, we can do:

df$goals_for
 [1] 68 86 32 33 31 66 55 42 81 39 46 42 89 50 26 60 45 44

The dollar symbol here is what is called an extraction operator. The dollar symbol is required because goals_for is part of df. The variable goals_for is not a standalone vector in our environment. The df$ tells R to look for goals_for inside df.

We can also use the name of the variable in the part where we specify the column indices:

df[, "goals_for"]
 [1] 68 86 32 33 31 66 55 42 81 39 46 42 89 50 26 60 45 44

We can also place multiple variable names in there:

df[, c("team", "goals_for")]
               team goals_for
1                AZ        68
2              Ajax        86
3         Excelsior        32
4          FC Emmen        33
5      FC Groningen        31
6         FC Twente        66
7        FC Utrecht        55
8       FC Volendam        42
9         Feyenoord        81
10  Fortuna Sittard        39
11  Go Ahead Eagles        46
12              NEC        42
13              PSV        89
14     RKC Waalwijk        50
15       SC Cambuur        26
16 Sparta Rotterdam        60
17          Vitesse        45
18    sc Heerenveen        44

Finally, another way to get a single variable from a dataframe is to place the name of the variable in quotes inside double square brackets:

df[["goals_for"]]
 [1] 68 86 32 33 31 66 55 42 81 39 46 42 89 50 26 60 45 44

We can also subset rows of a dataframe using logical operators, just like we saw in Chapter 5. For example, suppose we wanted to only see the results for teams that won at least 20 matches. The following will return a logical vector which is TRUE if the team won at least 20 matches, and FALSE if they won 19 or fewer matches:

df$wins >= 20
 [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[13]  TRUE FALSE FALSE FALSE FALSE FALSE

The first two are TRUE, because AZ and Ajax won at least 20 matches (they both won exactly 20). The next two are FALSE because Excelsior and FC Emmen won less than 20 matches (they won 9 and 6, respectively).

If we use this inside the square brackets where we specify the row indices, we get the desired result:

df[df$wins >= 20, ]
        team wins draws losses goals_for goals_against
1         AZ   20     7      7        68            35
2       Ajax   20     9      5        86            38
9  Feyenoord   25     7      2        81            30
13       PSV   23     6      5        89            40