<- read.csv(text = "
df team, wins, draws, losses, goals_for, goals_against
AZ, 20, 7, 7, 68, 35
Ajax, 20, 9, 5, 86, 38
Excelsior, 9, 5, 20, 32, 71
FC Emmen, 6, 10, 18, 33, 65
FC Groningen, 4, 6, 24, 31, 75
FC Twente, 18, 10, 6, 66, 27
FC Utrecht, 15, 9, 10, 55, 50
FC Volendam, 10, 6, 18, 42, 71
Feyenoord, 25, 7, 2, 81, 30
Fortuna Sittard, 10, 6, 18, 39, 62
Go Ahead Eagles, 10, 10, 14, 46, 56
NEC, 8, 15, 11, 42, 45
PSV, 23, 6, 5, 89, 40
RKC Waalwijk, 11, 8, 15, 50, 64
SC Cambuur, 5, 4, 25, 26, 69
Sparta Rotterdam, 17, 8, 9, 60, 37
Vitesse, 10, 10, 14, 45, 50
sc Heerenveen, 12, 10, 12, 44, 50
", strip.white = TRUE)
# Create goal difference:
$goal_diff <- df$goals_for - df$goals_against
df
# Create total points scored over the season:
$total_points <- 3 * df$wins + df$draws
df
# Order teams by season rank and create season ranking variable:
<- df[order(df$total_points, df$goal_diff, decreasing = TRUE), ]
df $ranking <- 1:nrow(df) df
12 Dataframes: Summary Statistics
In this chapter we will learn some techniques for summarizing a dataframe using the same running example.
We load up the data and create some of the missing variables again (summarizing what we did in the last chapter):
12.1 summary()
for Dataframes
To get a broad overview of a dataset, you can use the summary()
function that we used in Chapter 5 before for vectors. When we use this function on a dataframe, it will show the summary statistics for all variables in the dataframe:
summary(df)
team wins draws losses
Length:18 Min. : 4.00 Min. : 4.000 Min. : 2.00
Class :character 1st Qu.: 9.25 1st Qu.: 6.000 1st Qu.: 7.50
Mode :character Median :10.50 Median : 8.000 Median :13.00
Mean :12.94 Mean : 8.111 Mean :12.94
3rd Qu.:17.75 3rd Qu.:10.000 3rd Qu.:18.00
Max. :25.00 Max. :15.000 Max. :25.00
goals_for goals_against goal_diff total_points
Min. :26.00 Min. :27.00 Min. :-44.0 Min. :18.00
1st Qu.:39.75 1st Qu.:38.50 1st Qu.:-27.5 1st Qu.:36.00
Median :45.50 Median :50.00 Median : -5.5 Median :40.50
Mean :51.94 Mean :51.94 Mean : 0.0 Mean :46.94
3rd Qu.:64.50 3rd Qu.:64.75 3rd Qu.: 30.5 3rd Qu.:62.75
Max. :89.00 Max. :75.00 Max. : 51.0 Max. :82.00
ranking
Min. : 1.00
1st Qu.: 5.25
Median : 9.50
Mean : 9.50
3rd Qu.:13.75
Max. :18.00
For the team name, it just says character
, because we cannot find the mean of a character. The only information we get is the number of observations (18). All the other variables are numeric, and the summary statistics are shown for each one.
12.2 head()
and tail()
Another way to get a broad overview of a dataset is to just “eyeball” it by displaying it in the console with df
, or browsing it in RStudio with View(df)
. For datasets with many observations, however, it may be easier to just look at the first few rows. We can do that with the head()
function:
head(df)
team wins draws losses goals_for goals_against goal_diff
9 Feyenoord 25 7 2 81 30 51
13 PSV 23 6 5 89 40 49
2 Ajax 20 9 5 86 38 48
1 AZ 20 7 7 68 35 33
6 FC Twente 18 10 6 66 27 39
16 Sparta Rotterdam 17 8 9 60 37 23
total_points ranking
9 82 1
13 75 2
2 69 3
1 67 4
6 64 5
16 59 6
By default, head()
shows the first 6 rows. We can look at a different number by specifying the option n
. For example, to see the first 4 rows we would do:
head(df, n = 4)
team wins draws losses goals_for goals_against goal_diff total_points
9 Feyenoord 25 7 2 81 30 51 82
13 PSV 23 6 5 89 40 49 75
2 Ajax 20 9 5 86 38 48 69
1 AZ 20 7 7 68 35 33 67
ranking
9 1
13 2
2 3
1 4
The function tail()
does the exact opposite. It shows the last n
rows of the dataset, with 6 rows by default. To see the two teams that are automatically relegated (the bottom 2) we would do:
tail(df, n = 2)
team wins draws losses goals_for goals_against goal_diff
15 SC Cambuur 5 4 25 26 69 -43
5 FC Groningen 4 6 24 31 75 -44
total_points ranking
15 19 17
5 18 18
12.3 nrow()
and ncol()
Something we are often interested in is the total number of observations. We can find this by checking the number of rows in the dataframe with the nrow()
function. In this case it is the number of teams. The number of columns (found with ncol()
) shows the total number of variables.
nrow(df)
[1] 18
ncol(df)
[1] 9
If we want to quickly find both of these numbers, we can also use the dim()
function, which shows the dimensions of the dataframe (first the number of rows, then the number of columns):
dim(df)
[1] 18 9
12.4 names()
Sometimes we are just interested in what variables are included in the dataset. To see this, we can use the names()
function:
names(df)
[1] "team" "wins" "draws" "losses"
[5] "goals_for" "goals_against" "goal_diff" "total_points"
[9] "ranking"