Tutorial Exercises Week 7

Question 1

Download the two datasets:

Read in both datasets. When reading in the house price dataset you should use the following command:

read.csv("cpb-house-prices.csv", sep = ";", dec = ",")

This is because the dataset uses semicolons to separate the columns instead of commas, and uses commas for decimals.

Rename the 3 variables to: "municipality", "house_price_2022", "house_price_2021".

The 2nd dataset is can be read in with the read.csv() function without any special options. Rename the 2 variables in that dataset to: "municipality", "pop_growth_2018_2023".

Merge the two datasets together by the variable "municipality".

One municipality from the population growth dataset fails to merge with the house price dataset. Which municipality is this?

Question 2

How many municipalities from the house price dataset fail to merge with the population growth dataset?

Question 3

Create a scatter plot using ggplot of population growth on the horizontal axis and the house price in 2022 on the vertical axis.

Add the following layer to your plot to get a fitted line through the points:

geom_smooth(method = "lm")

Choose the answer below which best interprets what we can see in the plot.

  • Municipalities with higher population growth on average have higher house prices.
  • Municipalities with higher population growth on average have lower house prices.

Question 4

Reshape the original house price dataset from wide format to long format using the municipality as the ID variable. How many rows does the long format dataset have?

Question 5

If you correctly reshaped the dataset from the previous question the first 4 rows should look like:

  municipality         variable  value
1  Bloemendaal house_price_2022 1118.9
2     Blaricum house_price_2022 1099.1
3  Laren (NH.) house_price_2022 1030.1
4    Wassenaar house_price_2022  970.8

Suppose the long format dataset is called df1_long. Which of the following commands will return the dataset back to its original format (apart from the order of the observations)?

  • dcast(df1_long, municipality ~ variable)
  • dcast(df1_long, variable ~ municipality)
  • dcast(df1_long, value ~ municipality)
  • dcast(df1_long, municipality ~ value)

Question 6

Download the dataset municipality-province.csv.

This dataset contains two variables: the municipality and the province in which each municipality is located.

Read in the dataset and rename the variables to "municipality", "province".

Merge the municipality-province.csv dataset with your previously-merged house price and population growth dataset.

Calculate the average of the variable house_price_2022 by province.

Which province has the highest average?