20  Reshaping

20.1 From Long to Wide

Suppose we have the following dataset:

long <- data.frame(
  id       = rep(1:3, each = 2),
  variable = rep(c("x", "y"), times = 3),
  value    = c(3, 5, 4, 8, 3, 1)
)
long
  id variable value
1  1        x     3
2  1        y     5
3  2        x     4
4  2        y     8
5  3        x     3
6  3        y     1

This dataset is in what is called “long” format. We have 3 individuals, with IDs 1, 2, 3. For each individual we have 2 variables, x and y, and for each individual and variable we observe the value in the value column.

If we want to reshape this dataset so that it has only 1 row per individual (3 rows in total), with the variables x and y as separate variables, we can use functions from the reshape2 package. Install the package with install.packages("reshape2"). You can use the dcast() function from this package to reshape the data as follows:

library(reshape2)
wide <- dcast(long, id ~ variable)
wide
  id x y
1  1 3 5
2  2 4 8
3  3 3 1

The first argument is the name of the dataset. The second argument is the formula for how to reshape. We put the ID variable that we want to represent the rows first, then we use the ~ symbol, and then we put the variable with the different variable names.

20.2 From Wide to Long

We can also go the other direction. Let’s get back to our original data by reshaping the new wide dataset back to long. Let’s call the output long2. We can do that with the melt() function:

long2 <- melt(wide, id.vars = "id")
long2
  id variable value
1  1        x     3
2  2        x     4
3  3        x     3
4  1        y     5
5  2        y     8
6  3        y     1

Again, the first argument is the name of the dataset. The second is the variable is the varying representing the observation IDs.

20.3 Example Usage Case

Sometimes with ggplot, we need to have the data in long format. This happens when we want to plot multiple variables on the same plot with different colors. Let’s use the petrol price dataset from Chapter 19 to demonstrate this:

# Read in and clean petrol price data:
df <- read.csv("avg_daily_petrol_prices.csv")
df$date <- as.Date(df$date)  # format dates
head(df)
        date       e5      e10   diesel
1 2014-06-08 1.551987 1.477774 1.353583
2 2014-06-09 1.576623 1.483362 1.385182
3 2014-06-10 1.569619 1.478455 1.374060
4 2014-06-11 1.572578 1.481119 1.377091
5 2014-06-12 1.574652 1.480383 1.378247
6 2014-06-13 1.584659 1.492049 1.387577

The dataset is currently in “wide” format. The date runs down the dataset and the variables (petrol prices) at each date are stored horizontally from this. Let’s go to “long” format with the melt() function, where "date" represents the observation IDs.

df2 <- melt(df, "date")
head(df2)
        date variable    value
1 2014-06-08       e5 1.551987
2 2014-06-09       e5 1.576623
3 2014-06-10       e5 1.569619
4 2014-06-11       e5 1.572578
5 2014-06-12       e5 1.574652
6 2014-06-13       e5 1.584659

Now the dataset is in long format: we have the date, a variable representing the variable names (variable), and the values of each variable (value).

Let’s use this long-format data to plot the petrol prices over time for each type of petrol:

library(ggplot2)
ggplot(df2, aes(date, value, color = variable)) +
  geom_line()

We can customize this plot a bit with:

levels(df2$variable) <- c("E5", "E10", "Diesel")
ggplot(df2, aes(date, value, color = variable)) +
  geom_line() +
  xlab("") +
  ylab("Average Daily Petrol Price (in Euro)") +
  scale_color_discrete(name = "Petrol Type:") +
  theme_minimal() +
  theme(legend.direction = "horizontal",
        legend.position = "bottom")