21  Aggregating by Group

Sometimes we want to get the average or sum of a variable by group. We can do this in R using the aggregate() function. Let’s learn this function using an example. We will use the daily average petrol price data from before. Suppose we wanted to get the average petrol price by year from this. Let’s load up the data again:

df <- read.csv("avg_daily_petrol_prices.csv")
df$date <- as.Date(df$date)  # format dates

To aggregate by year, we first need to create a variable which gives the year corresponding to the date. There are several ways to do this.

One way to get the year is to re-format the date so that it only shows the year. Recall from Chapter 13, that %Y represents the year in R:

df$year <- format(df$date, format = "%Y")

But an easier way is to use the year() function from the lubridate package. You can install it with install.packages("lubridate"). With this package loaded, we can use the function with:

library(lubridate)
df$year <- year(df$date)

The next step is to use the aggregate() function. If I want the average price of E10 petrol by year, I can do the following:

aggregate(e10 ~ year, FUN = mean, data = df)
  year      e10
1 2014 1.461632
2 2015 1.373425
3 2016 1.282375
4 2017 1.345430
5 2018 1.430955
6 2019 1.408177
7 2020 1.253603
8 2021 1.522833
9 2022 1.875897

Here we first provide the formula: We want the average of e10 by year so we write e10 ~ year. The function we want to use is the mean (to get the average). Finally, we provide the name of the dataset, df.

We can calculate the average of all petrol prices by year by adding all the variables with cbind():

aggregate(cbind(e5, e10, diesel) ~ year, FUN = mean, data = df)
  year       e5      e10   diesel
1 2014 1.519195 1.461632 1.334612
2 2015 1.393170 1.373425 1.173013
3 2016 1.302767 1.282375 1.081282
4 2017 1.368414 1.345430 1.161306
5 2018 1.454098 1.430955 1.287264
6 2019 1.430592 1.408177 1.265341
7 2020 1.288080 1.253603 1.111068
8 2021 1.579992 1.522833 1.387114
9 2022 1.933512 1.875897 1.941386

If you have a lot of variables, you can get the average of all of them by year by replacing the part before the ~ with a dot:

aggregate(. ~ year, FUN = mean, data = df)
  year    date       e5      e10   diesel
1 2014 16332.0 1.519195 1.461632 1.334612
2 2015 16618.0 1.393170 1.373425 1.173013
3 2016 16983.5 1.302767 1.282375 1.081282
4 2017 17349.0 1.368414 1.345430 1.161306
5 2018 17714.0 1.454098 1.430955 1.287264
6 2019 18079.0 1.430592 1.408177 1.265341
7 2020 18444.5 1.288080 1.253603 1.111068
8 2021 18810.0 1.579992 1.522833 1.387114
9 2022 19123.0 1.933512 1.875897 1.941386

You may notice that this also gives us the average of the date variable in the year as numbers that don’t look like dates. Similar to Excel, underlying each date in R is a number which is the number of days since the 1/1/1970 (in Excel it’s days since 1/1/1900). 1/1/2014 is 16,071 days since 1/1/1970, so that is why we see numbers this size there.

We can also replace mean with any function we like. For example, to get the maximum average daily price in a year:

aggregate(e10 ~ year, FUN = max, data = df)
  year      e10
1 2014 1.542430
2 2015 1.499718
3 2016 1.375595
4 2017 1.393057
5 2018 1.552005
6 2019 1.533291
7 2020 1.438141
8 2021 1.703694
9 2022 2.203362

To get the number of observations per year:

aggregate(e10 ~ year, FUN = length, data = df)
  year e10
1 2014 207
2 2015 365
3 2016 366
4 2017 365
5 2018 365
6 2019 365
7 2020 366
8 2021 365
9 2022 261