```
<- read.csv("avg_daily_petrol_prices.csv")
df $date <- as.Date(df$date) # format dates df
```

# 21 Aggregating by Group

Sometimes we want to get the average or sum of a variable by group. We can do this in R using the `aggregate()`

function. Let’s learn this function using an example. We will use the daily average petrol price data from before. Suppose we wanted to get the average petrol price by year from this. Let’s load up the data again:

To aggregate by year, we first need to create a variable which gives the year corresponding to the date. There are several ways to do this.

One way to get the year is to re-format the date so that it only shows the year. Recall from Chapter 13, that `%Y`

represents the year in R:

`$year <- format(df$date, format = "%Y") df`

But an easier way is to use the `year()`

function from the `lubridate`

package. You can install it with `install.packages("lubridate")`

. With this package loaded, we can use the function with:

```
library(lubridate)
$year <- year(df$date) df
```

The next step is to use the `aggregate()`

function. If I want the average price of E10 petrol by year, I can do the following:

`aggregate(e10 ~ year, FUN = mean, data = df)`

```
year e10
1 2014 1.461632
2 2015 1.373425
3 2016 1.282375
4 2017 1.345430
5 2018 1.430955
6 2019 1.408177
7 2020 1.253603
8 2021 1.522833
9 2022 1.875897
```

Here we first provide the formula: We want the average of `e10`

by `year`

so we write `e10 ~ year`

. The function we want to use is the `mean`

(to get the average). Finally, we provide the name of the dataset, `df`

.

We can calculate the average of all petrol prices by year by adding all the variables with `cbind()`

:

`aggregate(cbind(e5, e10, diesel) ~ year, FUN = mean, data = df)`

```
year e5 e10 diesel
1 2014 1.519195 1.461632 1.334612
2 2015 1.393170 1.373425 1.173013
3 2016 1.302767 1.282375 1.081282
4 2017 1.368414 1.345430 1.161306
5 2018 1.454098 1.430955 1.287264
6 2019 1.430592 1.408177 1.265341
7 2020 1.288080 1.253603 1.111068
8 2021 1.579992 1.522833 1.387114
9 2022 1.933512 1.875897 1.941386
```

If you have a lot of variables, you can get the average of all of them by year by replacing the part before the `~`

with a dot:

`aggregate(. ~ year, FUN = mean, data = df)`

```
year date e5 e10 diesel
1 2014 16332.0 1.519195 1.461632 1.334612
2 2015 16618.0 1.393170 1.373425 1.173013
3 2016 16983.5 1.302767 1.282375 1.081282
4 2017 17349.0 1.368414 1.345430 1.161306
5 2018 17714.0 1.454098 1.430955 1.287264
6 2019 18079.0 1.430592 1.408177 1.265341
7 2020 18444.5 1.288080 1.253603 1.111068
8 2021 18810.0 1.579992 1.522833 1.387114
9 2022 19123.0 1.933512 1.875897 1.941386
```

You may notice that this also gives us the average of the date variable in the year as numbers that don’t look like dates. Similar to Excel, underlying each date in R is a number which is the number of days since the 1/1/1970 (in Excel it’s days since 1/1/1900). 1/1/2014 is 16,071 days since 1/1/1970, so that is why we see numbers this size there.

We can also replace `mean`

with any function we like. For example, to get the maximum average daily price in a year:

`aggregate(e10 ~ year, FUN = max, data = df)`

```
year e10
1 2014 1.542430
2 2015 1.499718
3 2016 1.375595
4 2017 1.393057
5 2018 1.552005
6 2019 1.533291
7 2020 1.438141
8 2021 1.703694
9 2022 2.203362
```

To get the number of observations per year:

`aggregate(e10 ~ year, FUN = length, data = df)`

```
year e10
1 2014 207
2 2015 365
3 2016 366
4 2017 365
5 2018 365
6 2019 365
7 2020 366
8 2021 365
9 2022 261
```