<- read.csv(text = "
df team, wins, draws, losses, goals_for, goals_against
AZ, 20, 7, 7, 68, 35
Ajax, 20, 9, 5, 86, 38
Excelsior, 9, 5, 20, 32, 71
FC Emmen, 6, 10, 18, 33, 65
FC Groningen, 4, 6, 24, 31, 75
FC Twente, 18, 10, 6, 66, 27
FC Utrecht, 15, 9, 10, 55, 50
FC Volendam, 10, 6, 18, 42, 71
Feyenoord, 25, 7, 2, 81, 30
Fortuna Sittard, 10, 6, 18, 39, 62
Go Ahead Eagles, 10, 10, 14, 46, 56
NEC, 8, 15, 11, 42, 45
PSV, 23, 6, 5, 89, 40
RKC Waalwijk, 11, 8, 15, 50, 64
SC Cambuur, 5, 4, 25, 26, 69
Sparta Rotterdam, 17, 8, 9, 60, 37
Vitesse, 10, 10, 14, 45, 50
sc Heerenveen, 12, 10, 12, 44, 50
", strip.white = TRUE)
10 Dataframes: Indexing
In Chapter 4 and Chapter 8 we encountered dataframes. In the following three chapters we will learn more about how to work with them. This chapter we will be look at indexing with dataframes.
10.1 Running Example: The Eredivisie Results from 2022/23
The dataset we will work with in the next three chapters contains the team, number of wins, draws, losses, goals for, and goals against for all 18 teams and 38 matches from the 2022/23 season of the Eredivisie. The Eredivisie is the top Dutch association football league. Here, “goals for” means the total number of goals the team scored that season, whereas “goals against” is the total number of goals the team conceded that season.
You can copy the code chunk below directly into R and it will read the data in as a data.frame
. This is another way of reading in datasets into R. You can provide the contents of a CSV file directly into the read.csv()
function we have seen before instead of giving the filename. To do this we need to use the text
option. I am doing it this way to save you time having to copy the data and save a new file on your computer, but it’s also good to see other ways of reading in data.
Note: The strip.white
option I have used in this command is to remove the empty spaces before and after the team names.
10.2 Indexing with Dataframes
In Chapter 5 we learned that we can get the 3rd element of a vector a
with a[3]
. We can also extract elements of a ataframe in a similar way. To get the 2nd row and 3rd column of a dataframe, we do:
2, 3] df[
[1] 9
Inside the square bracket we first specify the rows, then after a comma we specify the columns.
We can also put multiple indexes in each part. Suppose we want a smaller dataframe of only the rows with Ajax, Feyenoord and PSV and only the columns with the team name and number of wins. We first check which rows those teams occupy (2, 9 and 13) and which columns those variables are in (1 and 2). We then do:
c(2, 9, 13), c(1, 2)] df[
team wins
2 Ajax 20
9 Feyenoord 25
13 PSV 23
If we leave the columns part blank, it will give us the entire row. For example, to get all the results for just Ajax we just get the 2nd row:
2, ] df[
team wins draws losses goals_for goals_against
2 Ajax 20 9 5 86 38
Similarly, if we leave the row part blank and only give column indices, it will give us all rows for those columns. If we just want the column of wins we can do:
2] df[,
[1] 20 20 9 6 4 18 15 10 25 10 10 8 23 11 5 17 10 12
We can also get a column of a dataframe using the name of the variable. For example, if we want to get the goals_for
column, we can do:
$goals_for df
[1] 68 86 32 33 31 66 55 42 81 39 46 42 89 50 26 60 45 44
The dollar symbol here is what is called an extraction operator. The dollar symbol is required because goals_for
is part of df
. The variable goals_for
is not a standalone vector in our environment. The df$
tells R to look for goals_for
inside df
.
We can also use the name of the variable in the part where we specify the column indices:
"goals_for"] df[,
[1] 68 86 32 33 31 66 55 42 81 39 46 42 89 50 26 60 45 44
We can also place multiple variable names in there:
c("team", "goals_for")] df[,
team goals_for
1 AZ 68
2 Ajax 86
3 Excelsior 32
4 FC Emmen 33
5 FC Groningen 31
6 FC Twente 66
7 FC Utrecht 55
8 FC Volendam 42
9 Feyenoord 81
10 Fortuna Sittard 39
11 Go Ahead Eagles 46
12 NEC 42
13 PSV 89
14 RKC Waalwijk 50
15 SC Cambuur 26
16 Sparta Rotterdam 60
17 Vitesse 45
18 sc Heerenveen 44
Finally, another way to get a single variable from a dataframe is to place the name of the variable in quotes inside double square brackets:
"goals_for"]] df[[
[1] 68 86 32 33 31 66 55 42 81 39 46 42 89 50 26 60 45 44
We can also subset rows of a dataframe using logical operators, just like we saw in Chapter 5. For example, suppose we wanted to only see the results for teams that won at least 20 matches. The following will return a logical vector which is TRUE
if the team won at least 20 matches, and FALSE
if they won 19 or fewer matches:
$wins >= 20 df
[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[13] TRUE FALSE FALSE FALSE FALSE FALSE
The first two are TRUE
, because AZ and Ajax won at least 20 matches (they both won exactly 20). The next two are FALSE
because Excelsior and FC Emmen won less than 20 matches (they won 9 and 6, respectively).
If we use this inside the square brackets where we specify the row indices, we get the desired result:
$wins >= 20, ] df[df
team wins draws losses goals_for goals_against
1 AZ 20 7 7 68 35
2 Ajax 20 9 5 86 38
9 Feyenoord 25 7 2 81 30
13 PSV 23 6 5 89 40