Often we want to be able to join datasets together to analyze the relationship between variables. For example, suppose we are interested in the relationship between the price of crude oil on commodities markets and the average price of petrol at the pumps over time. We download data on daily petrol prices and daily data on crude oil prices. But if one dataset has more observations (spans a longer time period) than the other, or one has missing observations (such as weekend and holiday values missing), it’s not so straightforward to match them up together.
Fortunately, the merge() function solves these problems. We will learn how to use that function here.
19.1 Data Cleaning
First we read in and clean the petrol price data. You can download the dataset here.
df1 <-read.csv("avg_daily_petrol_prices.csv")# Format dates:df1$date <-as.Date(df1$date, format ="%Y-%m-%d")summary(df1)
date e5 e10 diesel
Min. :2014-06-08 Min. :1.159 Min. :1.130 Min. :0.9558
1st Qu.:2016-07-03 1st Qu.:1.340 1st Qu.:1.318 1st Qu.:1.1322
Median :2018-07-29 Median :1.402 Median :1.379 Median :1.2353
Mean :2018-07-29 Mean :1.456 Mean :1.423 Mean :1.2811
3rd Qu.:2020-08-23 3rd Qu.:1.522 3rd Qu.:1.479 3rd Qu.:1.3217
Max. :2022-09-18 Max. :2.261 Max. :2.203 Max. :2.3343
nrow(df1)
[1] 3025
We now do the same with the Brent crude oil prices. You can download the dataset here.
df2 <-read.csv("Europe_Brent_Spot_Price_FOB.csv", skip =4)# Format dates:df2$Day <-as.Date(df2$Day, format ="%m/%d/%Y")# Rename variables:names(df2) <-c("date", "crude_oil")# Sort ascending:df2 <- df2[order(df2$date), ]summary(df2)
date crude_oil
Min. :1987-05-20 Min. : 9.10
1st Qu.:1996-03-06 1st Qu.: 19.03
Median :2005-01-04 Median : 38.08
Mean :2005-01-12 Mean : 48.22
3rd Qu.:2013-11-24 3rd Qu.: 69.67
Max. :2022-09-19 Max. :143.95
nrow(df2)
[1] 8970
We see that the petrol price data covers 2014-2022, and the crude oil price data covers 1987-2022. If we look closer at the dates, we notice that the crude oil price data doesn’t include the weekends, whereas the petrol price data does:
Now that the two datasets are clean, we merge the two using the merge() function. The first two arguments of the merge() function are the two datasets we want to merge. The third argument, by, specifies the variable name (in quotations) that link the two datasets. In our case, the variable linking the two is the date variable.
df <-merge(df1, df2, by ="date")summary(df)
date e5 e10 diesel
Min. :2014-06-09 Min. :1.159 Min. :1.130 Min. :0.9558
1st Qu.:2016-07-04 1st Qu.:1.339 1st Qu.:1.318 1st Qu.:1.1302
Median :2018-07-26 Median :1.402 Median :1.379 Median :1.2345
Mean :2018-07-27 Mean :1.455 Mean :1.422 Mean :1.2797
3rd Qu.:2020-08-19 3rd Qu.:1.521 3rd Qu.:1.478 3rd Qu.:1.3216
Max. :2022-09-16 Max. :2.261 Max. :2.203 Max. :2.3343
crude_oil
Min. : 9.12
1st Qu.: 48.54
Median : 61.18
Mean : 63.47
3rd Qu.: 72.97
Max. :133.18
nrow(df)
[1] 2107
We notice that the dataset becomes much smaller: only 2,107 observations instead of 3,025 in df1 and 8,970 in df2. This is because df1 only contained dates from 2014 onwards, and df2 only contains data on weekdays. The merged datasets only includes weekdays between 2014-2022, and is thus much smaller.
19.2.2 Keeping Unmatched Observations
If we want to avoid dropping the observations where there is no match, we can use one of the following options:
all.x = TRUE : Keeps all observations in the 1st dataset, but only merges data from the 2nd dataset when there is a match. When there is no match, variables in the 2nd dataset get assigned NA values.
all.y = TRUE : Keeps all observations in the 2nd dataset, but only merges data from the 1st dataset when there is a match. When there is no match, variables in the 1st dataset get assigned NA values.
all = TRUE : This keeps all observations from both datasets, and variables get assigned NA values when there is no match. This is equivalent to setting both all.x = TRUE and all.y = TRUE.
For example, suppose we use the all.x = TRUE option:
df <-merge(df1, df2, by ="date", all.x =TRUE)summary(df)
date e5 e10 diesel
Min. :2014-06-08 Min. :1.159 Min. :1.130 Min. :0.9558
1st Qu.:2016-07-03 1st Qu.:1.340 1st Qu.:1.318 1st Qu.:1.1322
Median :2018-07-29 Median :1.402 Median :1.379 Median :1.2353
Mean :2018-07-29 Mean :1.456 Mean :1.423 Mean :1.2811
3rd Qu.:2020-08-23 3rd Qu.:1.522 3rd Qu.:1.479 3rd Qu.:1.3217
Max. :2022-09-18 Max. :2.261 Max. :2.203 Max. :2.3343
crude_oil
Min. : 9.12
1st Qu.: 48.54
Median : 61.18
Mean : 63.47
3rd Qu.: 72.97
Max. :133.18
NA's :918
nrow(df)
[1] 3025
We see that we have 3,025 rows, the same as the original df1. However, the variable crude_oil, which was merged from df2 now has 918 missing values. Any time df2$crude_oil didn’t have a value for a date in df1, we replace it with an NA. When we didn’t specify all.x = TRUE, we ended up with 2,107 observations. This is because it dropped all the rows where there was no match. The 2,107 comes from the 3,025 original rows in df1, minus the 918 rows where there was no corresponding match in df2 (3,025-918=2,107).
19.2.3 Other Merging Options
Finally, we end with some further remarks on the merge() function. First, if you are merging on multiple variables, you can include a vector of variable names in the by argument. For example, suppose you are merging the two datasets:
df1: contains the revenue in each market area (variable market_area) and date (variable date).
df2: contains the advertising expenditure in each market area (variable market_area) and date (variable date).
Let’s create two example datasets for illustration purposes here:
df1 <-data.frame(expand.grid(market_area =c("Market A", "Market B"),date =seq(as.Date("2022-01-01"), as.Date("2022-03-01"), by ="month")))df1$revenue <-runif(n =nrow(df1), min =0, max =1000)df2 <-data.frame(expand.grid(market_area =c("Market A", "Market B"),date =seq.Date(as.Date("2022-01-01"), as.Date("2022-03-01"), by ="month")))df2$advertising_exp <-runif(n =nrow(df1), min =0, max =500)df1
market_area date revenue
1 Market A 2022-01-01 485.76845
2 Market B 2022-01-01 940.01892
3 Market A 2022-02-01 74.59912
4 Market B 2022-02-01 785.79297
5 Market A 2022-03-01 109.54504
6 Market B 2022-03-01 961.13105
df2
market_area date advertising_exp
1 Market A 2022-01-01 480.6382
2 Market B 2022-01-01 426.6870
3 Market A 2022-02-01 236.2161
4 Market B 2022-02-01 302.1044
5 Market A 2022-03-01 415.5219
6 Market B 2022-03-01 212.2895
Some notes on the functions used here:
The expand.grid() function here creates every combination of each market area and date (you won’t be asked to use this function in the exam).
The seq() function can also be used to create sequences of dates. We can specify the step length to be "day""month", "quarter" or "year".
The runif() function creates n random numbers between min and max (you won’t be asked to use this function in the exam).
We can merge these two datasets by simply including a vector of the variable names in the by argument:
df <-merge(df1, df2, by =c("market_area", "date"))df
market_area date revenue advertising_exp
1 Market A 2022-01-01 485.76845 480.6382
2 Market A 2022-02-01 74.59912 236.2161
3 Market A 2022-03-01 109.54504 415.5219
4 Market B 2022-01-01 940.01892 426.6870
5 Market B 2022-02-01 785.79297 302.1044
6 Market B 2022-03-01 961.13105 212.2895
If the variable names are different in the two datasets, we could change the names of the variables to make them match before merging. But what we could do instead is use the by.x and by.y options in the merge() function. For instance, suppose in the previous example the market area variable was called "market" in df2 instead of "market_area". Let’s change the name of the variable in df2 to that:
names(df2)[names(df2) =="market_area"] <-"market"
If I want to merge the two datasets in this case I can do: