19  Merging

Often we want to be able to join datasets together to analyze the relationship between variables. For example, suppose we are interested in the relationship between the price of crude oil on commodities markets and the average price of petrol at the pumps over time. We download data on daily petrol prices and daily data on crude oil prices. But if one dataset has more observations (spans a longer time period) than the other, or one has missing observations (such as weekend and holiday values missing), it’s not so straightforward to match them up together.

Fortunately, the merge() function solves these problems. We will learn how to use that function here.

19.1 Data Cleaning

First we read in and clean the petrol price data. You can download the dataset here.

df1 <- read.csv("avg_daily_petrol_prices.csv")
# Format dates:
df1$date <- as.Date(df1$date, format = "%Y-%m-%d")
summary(df1)
      date                  e5             e10            diesel      
 Min.   :2014-06-08   Min.   :1.159   Min.   :1.130   Min.   :0.9558  
 1st Qu.:2016-07-03   1st Qu.:1.340   1st Qu.:1.318   1st Qu.:1.1322  
 Median :2018-07-29   Median :1.402   Median :1.379   Median :1.2353  
 Mean   :2018-07-29   Mean   :1.456   Mean   :1.423   Mean   :1.2811  
 3rd Qu.:2020-08-23   3rd Qu.:1.522   3rd Qu.:1.479   3rd Qu.:1.3217  
 Max.   :2022-09-18   Max.   :2.261   Max.   :2.203   Max.   :2.3343  
nrow(df1)
[1] 3025

We now do the same with the Brent crude oil prices. You can download the dataset here.

df2 <- read.csv("Europe_Brent_Spot_Price_FOB.csv", skip = 4)
# Format dates:
df2$Day <- as.Date(df2$Day, format = "%m/%d/%Y")
# Rename variables:
names(df2) <- c("date", "crude_oil")
# Sort ascending:
df2 <- df2[order(df2$date), ]
summary(df2)
      date              crude_oil     
 Min.   :1987-05-20   Min.   :  9.10  
 1st Qu.:1996-03-06   1st Qu.: 19.03  
 Median :2005-01-04   Median : 38.08  
 Mean   :2005-01-12   Mean   : 48.22  
 3rd Qu.:2013-11-24   3rd Qu.: 69.67  
 Max.   :2022-09-19   Max.   :143.95  
nrow(df2)
[1] 8970

We see that the petrol price data covers 2014-2022, and the crude oil price data covers 1987-2022. If we look closer at the dates, we notice that the crude oil price data doesn’t include the weekends, whereas the petrol price data does:

head(df1$date, n = 10)
 [1] "2014-06-08" "2014-06-09" "2014-06-10" "2014-06-11" "2014-06-12"
 [6] "2014-06-13" "2014-06-14" "2014-06-15" "2014-06-16" "2014-06-17"
head(df2$date, n = 10)
 [1] "1987-05-20" "1987-05-21" "1987-05-22" "1987-05-25" "1987-05-26"
 [6] "1987-05-27" "1987-05-28" "1987-05-29" "1987-06-01" "1987-06-02"

19.2 Merging

19.2.1 The merge() Command

Now that the two datasets are clean, we merge the two using the merge() function. The first two arguments of the merge() function are the two datasets we want to merge. The third argument, by, specifies the variable name (in quotations) that links the two datasets. In our case, the variable linking the two is the date variable.

df <- merge(df1, df2, by = "date")
summary(df)
      date                  e5             e10            diesel      
 Min.   :2014-06-09   Min.   :1.159   Min.   :1.130   Min.   :0.9558  
 1st Qu.:2016-07-04   1st Qu.:1.339   1st Qu.:1.318   1st Qu.:1.1302  
 Median :2018-07-26   Median :1.402   Median :1.379   Median :1.2345  
 Mean   :2018-07-27   Mean   :1.455   Mean   :1.422   Mean   :1.2797  
 3rd Qu.:2020-08-19   3rd Qu.:1.521   3rd Qu.:1.478   3rd Qu.:1.3216  
 Max.   :2022-09-16   Max.   :2.261   Max.   :2.203   Max.   :2.3343  
   crude_oil     
 Min.   :  9.12  
 1st Qu.: 48.54  
 Median : 61.18  
 Mean   : 63.47  
 3rd Qu.: 72.97  
 Max.   :133.18  
nrow(df)
[1] 2107

We notice that the dataset becomes much smaller: only 2,107 observations instead of 3,025 in df1 and 8,970 in df2. This is because df1 only contained dates from 2014 onwards, and df2 only contains data on weekdays. The merged datasets only includes weekdays between 2014-2022, and is thus much smaller.

19.2.2 Keeping Unmatched Observations

If we want to avoid dropping the observations where there is no match, we can use one of the following options:

  • all.x = TRUE : Keeps all observations in the 1st dataset, but only merges data from the 2nd dataset when there is a match. When there is no match, variables in the 2nd dataset get assigned NA values.
  • all.y = TRUE : Keeps all observations in the 2nd dataset, but only merges data from the 1st dataset when there is a match. When there is no match, variables in the 1st dataset get assigned NA values.
  • all = TRUE : This keeps all observations from both datasets, and variables get assigned NA values when there is no match. This is equivalent to setting both all.x = TRUE and all.y = TRUE.

For example, suppose we use the all.x = TRUE option:

df <- merge(df1, df2, by = "date", all.x = TRUE)
summary(df)
      date                  e5             e10            diesel      
 Min.   :2014-06-08   Min.   :1.159   Min.   :1.130   Min.   :0.9558  
 1st Qu.:2016-07-03   1st Qu.:1.340   1st Qu.:1.318   1st Qu.:1.1322  
 Median :2018-07-29   Median :1.402   Median :1.379   Median :1.2353  
 Mean   :2018-07-29   Mean   :1.456   Mean   :1.423   Mean   :1.2811  
 3rd Qu.:2020-08-23   3rd Qu.:1.522   3rd Qu.:1.479   3rd Qu.:1.3217  
 Max.   :2022-09-18   Max.   :2.261   Max.   :2.203   Max.   :2.3343  
                                                                      
   crude_oil     
 Min.   :  9.12  
 1st Qu.: 48.54  
 Median : 61.18  
 Mean   : 63.47  
 3rd Qu.: 72.97  
 Max.   :133.18  
 NA's   :918     
nrow(df)
[1] 3025

We see that we have 3,025 rows, the same as the original df1. However, the variable crude_oil, which was merged from df2 now has 918 missing values. Any time df2$crude_oil didn’t have a value for a date in df1, we replace it with an NA. When we didn’t specify all.x = TRUE, we ended up with 2,107 observations. This is because it dropped all the rows where there was no match. The 2,107 comes from the 3,025 original rows in df1, minus the 918 rows where there was no corresponding match in df2 (3,025-918=2,107).

19.2.3 Other Merging Options

Finally, we end with some further remarks on the merge() function. First, if you are merging on multiple variables, you can include a vector of variable names in the by argument. For example, suppose you are merging the two datasets:

  • df1: contains the revenue in each market area (variable market_area) and date (variable date).
  • df2: contains the advertising expenditure in each market area (variable market_area) and date (variable date).

Let’s create two example datasets for illustration purposes here:

df1 <- data.frame(expand.grid(
  market_area = c("Market A", "Market B"),
  date = seq(as.Date("2022-01-01"), as.Date("2022-03-01"), by = "month")
))
df1$revenue <- runif(n = nrow(df1), min = 0, max = 1000)

df2 <- data.frame(expand.grid(
  market_area = c("Market A", "Market B"),
  date = seq(as.Date("2022-01-01"), as.Date("2022-03-01"), by = "month")
))
df2$advertising_exp <- runif(n = nrow(df1), min = 0, max = 500)

df1
  market_area       date   revenue
1    Market A 2022-01-01 485.76845
2    Market B 2022-01-01 940.01892
3    Market A 2022-02-01  74.59912
4    Market B 2022-02-01 785.79297
5    Market A 2022-03-01 109.54504
6    Market B 2022-03-01 961.13105
df2
  market_area       date advertising_exp
1    Market A 2022-01-01        480.6382
2    Market B 2022-01-01        426.6870
3    Market A 2022-02-01        236.2161
4    Market B 2022-02-01        302.1044
5    Market A 2022-03-01        415.5219
6    Market B 2022-03-01        212.2895

Some notes on the functions used here:

  • The expand.grid() function here creates every combination of each market area and date (you won’t be asked to use this function in the exam).
  • The seq() function can also be used to create sequences of dates. We can specify the step length to be "day" "month", "quarter" or "year".
  • The runif() function creates n random numbers between min and max (you won’t be asked to use this function in the exam).

We can merge these two datasets by simply including a vector of the variable names in the by argument:

df <- merge(df1, df2, by = c("market_area", "date"))
df
  market_area       date   revenue advertising_exp
1    Market A 2022-01-01 485.76845        480.6382
2    Market A 2022-02-01  74.59912        236.2161
3    Market A 2022-03-01 109.54504        415.5219
4    Market B 2022-01-01 940.01892        426.6870
5    Market B 2022-02-01 785.79297        302.1044
6    Market B 2022-03-01 961.13105        212.2895

If the variable names are different in the two datasets, we could change the names of the variables to make them match before merging. But what we could do instead is use the by.x and by.y options in the merge() function. For instance, suppose in the previous example the market area variable was called "market" in df2 instead of "market_area". Let’s change the name of the variable in df2 to that:

names(df2)[names(df2) == "market_area"] <- "market"

If I want to merge the two datasets in this case I can do:

df <- merge(df1, df2, by.x = c("market_area", "date"),
            by.y = c("market", "date"))
df
  market_area       date   revenue advertising_exp
1    Market A 2022-01-01 485.76845        480.6382
2    Market A 2022-02-01  74.59912        236.2161
3    Market A 2022-03-01 109.54504        415.5219
4    Market B 2022-01-01 940.01892        426.6870
5    Market B 2022-02-01 785.79297        302.1044
6    Market B 2022-03-01 961.13105        212.2895

Finally, by default, merge() will sort the data by the merging variable(s). To avoid this behaviour you can use the sort = FALSE option.