12  Dataframes: Summary Statistics

In this chapter we will learn some techniques for summarizing a dataframe using the same running example.

We load up the data and create some of the missing variables again (summarizing what we did in the last chapter):

df <- read.csv(text = "
            team, wins, draws, losses, goals_for, goals_against
              AZ,   20,     7,      7,        68,            35
            Ajax,   20,     9,      5,        86,            38
       Excelsior,    9,     5,     20,        32,            71
        FC Emmen,    6,    10,     18,        33,            65
    FC Groningen,    4,     6,     24,        31,            75
       FC Twente,   18,    10,      6,        66,            27
      FC Utrecht,   15,     9,     10,        55,            50
     FC Volendam,   10,     6,     18,        42,            71
       Feyenoord,   25,     7,      2,        81,            30
 Fortuna Sittard,   10,     6,     18,        39,            62
 Go Ahead Eagles,   10,    10,     14,        46,            56
             NEC,    8,    15,     11,        42,            45
             PSV,   23,     6,      5,        89,            40
    RKC Waalwijk,   11,     8,     15,        50,            64
      SC Cambuur,    5,     4,     25,        26,            69
Sparta Rotterdam,   17,     8,      9,        60,            37
         Vitesse,   10,    10,     14,        45,            50
   sc Heerenveen,   12,    10,     12,        44,            50
", strip.white = TRUE)

# Create goal difference:
df$goal_diff <- df$goals_for - df$goals_against

# Create total points scored over the season:
df$total_points <- 3 * df$wins + df$draws

# Order teams by season rank and create season ranking variable:
df <- df[order(df$total_points, df$goal_diff, decreasing = TRUE), ]
df$ranking <- 1:nrow(df)

12.1 summary() for Dataframes

To get a broad overview of a dataset, you can use the summary() function that we used in Chapter 5 before for vectors. When we use this function on a dataframe, it will show the summary statistics for all variables in the dataframe:

summary(df)
     team                wins           draws            losses     
 Length:18          Min.   : 4.00   Min.   : 4.000   Min.   : 2.00  
 Class :character   1st Qu.: 9.25   1st Qu.: 6.000   1st Qu.: 7.50  
 Mode  :character   Median :10.50   Median : 8.000   Median :13.00  
                    Mean   :12.94   Mean   : 8.111   Mean   :12.94  
                    3rd Qu.:17.75   3rd Qu.:10.000   3rd Qu.:18.00  
                    Max.   :25.00   Max.   :15.000   Max.   :25.00  
   goals_for     goals_against     goal_diff      total_points  
 Min.   :26.00   Min.   :27.00   Min.   :-44.0   Min.   :18.00  
 1st Qu.:39.75   1st Qu.:38.50   1st Qu.:-27.5   1st Qu.:36.00  
 Median :45.50   Median :50.00   Median : -5.5   Median :40.50  
 Mean   :51.94   Mean   :51.94   Mean   :  0.0   Mean   :46.94  
 3rd Qu.:64.50   3rd Qu.:64.75   3rd Qu.: 30.5   3rd Qu.:62.75  
 Max.   :89.00   Max.   :75.00   Max.   : 51.0   Max.   :82.00  
    ranking     
 Min.   : 1.00  
 1st Qu.: 5.25  
 Median : 9.50  
 Mean   : 9.50  
 3rd Qu.:13.75  
 Max.   :18.00  

For the team name, it just says character, because we cannot find the mean of a character. The only information we get is the number of observations (18). All the other variables are numeric, and the summary statistics are shown for each one.

12.2 head() and tail()

Another way to get a broad overview of a dataset is to just “eyeball” it by displaying it in the console with df, or browsing it in RStudio with View(df). For datasets with many observations, however, it may be easier to just look at the first few rows. We can do that with the head() function:

head(df)
               team wins draws losses goals_for goals_against goal_diff
9         Feyenoord   25     7      2        81            30        51
13              PSV   23     6      5        89            40        49
2              Ajax   20     9      5        86            38        48
1                AZ   20     7      7        68            35        33
6         FC Twente   18    10      6        66            27        39
16 Sparta Rotterdam   17     8      9        60            37        23
   total_points ranking
9            82       1
13           75       2
2            69       3
1            67       4
6            64       5
16           59       6

By default, head() shows the first 6 rows. We can look at a different number by specifying the option n. For example, to see the first 4 rows we would do:

head(df, n = 4)
        team wins draws losses goals_for goals_against goal_diff total_points
9  Feyenoord   25     7      2        81            30        51           82
13       PSV   23     6      5        89            40        49           75
2       Ajax   20     9      5        86            38        48           69
1         AZ   20     7      7        68            35        33           67
   ranking
9        1
13       2
2        3
1        4

The function tail() does the exact opposite. It shows the last n rows of the dataset, with 6 rows by default. To see the two teams that are automatically relegated (the bottom 2) we would do:

tail(df, n = 2)
           team wins draws losses goals_for goals_against goal_diff
15   SC Cambuur    5     4     25        26            69       -43
5  FC Groningen    4     6     24        31            75       -44
   total_points ranking
15           19      17
5            18      18

12.3 nrow() and ncol()

Something we are often interested in is the total number of observations. We can find this by checking the number of rows in the dataframe with the nrow() function. In this case it is the number of teams. The number of columns (found with ncol()) shows the total number of variables.

nrow(df)
[1] 18
ncol(df)
[1] 9

If we want to quickly find both of these numbers, we can also use the dim() function, which shows the dimensions of the dataframe (first the number of rows, then the number of columns):

dim(df)
[1] 18  9

12.4 names()

Sometimes we are just interested in what variables are included in the dataset. To see this, we can use the names() function:

names(df)
[1] "team"          "wins"          "draws"         "losses"       
[5] "goals_for"     "goals_against" "goal_diff"     "total_points" 
[9] "ranking"