Data collection
I would like to show you where I got data, how I collected, investigated and clean dataset for explotatory data analysis of the Amsterdam residents.
There is a Wijk- en buurtkaart 2019 dataset which everybody could download. It’s quite big, so I filtered it to have only rows about Amsterdam neighbourhoods. The result dataset was saved in ams_buurten_2019.csv
file.
Let’s take a look at what we have there:
import pandas as pd
import geopandas as gpd
data = pd.read_csv('ams_buurten_2019.csv', sep = ";")
data = data[data["water"] == "NEE"]
print(data.shape)
data.head()
(470, 206)
fid | geom | buurtcode | buurtnaam | wijkcode | gemeentecode | gemeentenaam | indelingswijziging_wijken_en_buurten | water | meest_voorkomende_postcode | ... | theater_gemiddelde_afstand_in_km | theater_gemiddeld_aantal_binnen_5_km | theater_gemiddeld_aantal_binnen_10_km | theater_gemiddeld_aantal_binnen_20_km | gemiddelde_afstand_tot_museum | gemiddeld_aantal_musea_binnen_5_km | gemiddeld_aantal_musea_binnen_10_km | gemiddeld_aantal_musea_binnen_20_km | jaarstatcode | jaar | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 4159 | GP | BU03630000 | Kop Zeedijk | WK036300 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 0.6 | 37.2 | 49.3 | 63.0 | 0.2 | 42.7 | 46.0 | 65.0 | 2019BU03630000 | 2019.0 |
3 | 4160 | GP | BU03630001 | Oude Kerk e.o. | WK036300 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 0.5 | 37.0 | 50.4 | 62.9 | 0.2 | 42.6 | 46.1 | 65.0 | 2019BU03630001 | 2019.0 |
4 | 4161 | GP | BU03630002 | Burgwallen Oost | WK036300 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 0.3 | 37.0 | 51.0 | 62.4 | 0.2 | 43.0 | 46.1 | 65.0 | 2019BU03630002 | 2019.0 |
5 | 4162 | GP | BU03630003 | Nes e.o. | WK036300 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 0.2 | 37.0 | 51.0 | 62.6 | 0.2 | 43.0 | 46.7 | 65.0 | 2019BU03630003 | 2019.0 |
6 | 4163 | GP | BU03630004 | BG-terrein e.o. | WK036300 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 0.2 | 37.0 | 51.0 | 62.0 | 0.3 | 43.0 | 47.3 | 65.0 | 2019BU03630004 | 2019.0 |
5 rows × 206 columns
Now I want to add bigger neighbourhoods called wijk
in dutch, so I would be able to add districts (called stadsdelen
).
From the Wijk- en buurtkaart 2019 dataset mentioned above I also got a dataset wijken.csv
, but I only need wijknaam
and wijkcode
from it.
Let’s get it and then add wijknaam
to the main dataset:
new_data = pd.read_csv('wijken.csv', sep = ";")
data_wijk = new_data[new_data['gemeentenaam'] == "Amsterdam"]
data_wijk = data_wijk[data_wijk['wijknaam'] != ' ']
data_wijk = data_wijk[['wijkcode', 'wijknaam']]
print(data_wijk.shape)
data_wijk.head()
(99, 2)
wijkcode | wijknaam | |
---|---|---|
905 | WK036300 | Burgwallen-Oude Zijde |
906 | WK036301 | Burgwallen-Nieuwe Zijde |
907 | WK036302 | Grachtengordel-West |
908 | WK036303 | Grachtengordel-Zuid |
909 | WK036304 | Nieuwmarkt/Lastage |
#data = pd.merge(data, data_wijk, on="wijkcode")
def get_wijk(row):
data_row = data_wijk[data_wijk['wijkcode'] == row['wijkcode']]
value = "" if data_row.empty else data_row['wijknaam'].iloc[0]
return value
data = data.copy()
data['wijknaam'] = data.apply(lambda row: get_wijk(row), axis=1)
data['wijkcode'] = data['wijkcode'].astype(str).str.replace('WK0363', '')
print(data.shape)
data.head()
(470, 207)
fid | geom | buurtcode | buurtnaam | wijkcode | gemeentecode | gemeentenaam | indelingswijziging_wijken_en_buurten | water | meest_voorkomende_postcode | ... | theater_gemiddeld_aantal_binnen_5_km | theater_gemiddeld_aantal_binnen_10_km | theater_gemiddeld_aantal_binnen_20_km | gemiddelde_afstand_tot_museum | gemiddeld_aantal_musea_binnen_5_km | gemiddeld_aantal_musea_binnen_10_km | gemiddeld_aantal_musea_binnen_20_km | jaarstatcode | jaar | wijknaam | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 4159 | GP | BU03630000 | Kop Zeedijk | 00 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 37.2 | 49.3 | 63.0 | 0.2 | 42.7 | 46.0 | 65.0 | 2019BU03630000 | 2019.0 | Burgwallen-Oude Zijde |
3 | 4160 | GP | BU03630001 | Oude Kerk e.o. | 00 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 37.0 | 50.4 | 62.9 | 0.2 | 42.6 | 46.1 | 65.0 | 2019BU03630001 | 2019.0 | Burgwallen-Oude Zijde |
4 | 4161 | GP | BU03630002 | Burgwallen Oost | 00 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 37.0 | 51.0 | 62.4 | 0.2 | 43.0 | 46.1 | 65.0 | 2019BU03630002 | 2019.0 | Burgwallen-Oude Zijde |
5 | 4162 | GP | BU03630003 | Nes e.o. | 00 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 37.0 | 51.0 | 62.6 | 0.2 | 43.0 | 46.7 | 65.0 | 2019BU03630003 | 2019.0 | Burgwallen-Oude Zijde |
6 | 4163 | GP | BU03630004 | BG-terrein e.o. | 00 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 37.0 | 51.0 | 62.0 | 0.3 | 43.0 | 47.3 | 65.0 | 2019BU03630004 | 2019.0 | Burgwallen-Oude Zijde |
5 rows × 207 columns
From this website (from the parts called ‘Buurten OUD’ and ‘Stadsdelen OUD’) I got datasets which describe how Amsterdam divided by buurten
and stadsdelen
.
Let’s open it, prepare and add info about city districts (stadsdelen
) from it to the main dataset:
import re
data_wijk_stadsdeelcode = pd.read_csv('GEBIED_BUURTEN.csv', sep = ";")
data_wijk_stadsdeelcode = data_wijk_stadsdeelcode[['Wijkcode', 'Stadsdeelcode']]
data_wijk_stadsdeelcode = data_wijk_stadsdeelcode.rename(columns={'Wijkcode': 'wijkcode'})
data_wijk_stadsdeelcode = data_wijk_stadsdeelcode.drop_duplicates(subset=["wijkcode"], keep='first')
print(data_wijk_stadsdeelcode.shape)
data_wijk_stadsdeelcode.head()
(107, 2)
wijkcode | Stadsdeelcode | |
---|---|---|
0 | A00 | A |
3 | A01 | A |
11 | A02 | A |
14 | A03 | A |
21 | A04 | A |
data_stadsdelen = pd.read_csv('GEBIED_STADSDELEN.csv', sep = ";")
data_stadsdelen = data_stadsdelen[['Stadsdeelcode', 'Stadsdeelnaam']]
data_stadsdelen = data_stadsdelen.rename(columns={'Stadsdeelnaam': 'stadsdeel'})
#print(data_stadsdelen.shape)
data_stadsdelen.head()
Stadsdeelcode | stadsdeel | |
---|---|---|
0 | A | Centrum |
1 | B | Westpoort |
2 | E | West |
3 | F | Nieuw-West |
4 | K | Zuid |
print(data_stadsdelen.shape)
data_stadsdelen = pd.merge(data_stadsdelen, data_wijk_stadsdeelcode, on="Stadsdeelcode")
print(data_stadsdelen.shape)
data_stadsdelen.head()
(9, 2)
(107, 3)
Stadsdeelcode | stadsdeel | wijkcode | |
---|---|---|---|
0 | A | Centrum | A00 |
1 | A | Centrum | A01 |
2 | A | Centrum | A02 |
3 | A | Centrum | A03 |
4 | A | Centrum | A04 |
data_stadsdelen['wijkcode'] = data_stadsdelen['wijkcode'].astype(str).str.replace('\D', 'WK0363', regex=True)
data_stadsdelen.to_excel('amsterdam_stadsdelen.xlsx', index = None, header=True)
data_stadsdelen['wijkcode'] = data_stadsdelen['wijkcode'].astype(str).str.replace('WK0363', '', regex=True)
print(data_stadsdelen.shape)
data_stadsdelen.head()
(107, 3)
Stadsdeelcode | stadsdeel | wijkcode | |
---|---|---|---|
0 | A | Centrum | 00 |
1 | A | Centrum | 01 |
2 | A | Centrum | 02 |
3 | A | Centrum | 03 |
4 | A | Centrum | 04 |
#data = pd.merge(data, data_wijk, on="wijkcode")
def get_stadsdeel(row):
#print("wijkcode: " + row['wijkcode'] + " stadsdeel: " + )
data_row = data_stadsdelen[data_stadsdelen['wijkcode'] == row['wijkcode']]
value = "" if data_row.empty else data_row['stadsdeel'].iloc[0]
return value
data = data.copy()
data['stadsdeel'] = data.apply(lambda row: get_stadsdeel(row), axis=1)
print(data.shape)
data.head()
(470, 208)
fid | geom | buurtcode | buurtnaam | wijkcode | gemeentecode | gemeentenaam | indelingswijziging_wijken_en_buurten | water | meest_voorkomende_postcode | ... | theater_gemiddeld_aantal_binnen_10_km | theater_gemiddeld_aantal_binnen_20_km | gemiddelde_afstand_tot_museum | gemiddeld_aantal_musea_binnen_5_km | gemiddeld_aantal_musea_binnen_10_km | gemiddeld_aantal_musea_binnen_20_km | jaarstatcode | jaar | wijknaam | stadsdeel | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 4159 | GP | BU03630000 | Kop Zeedijk | 00 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 49.3 | 63.0 | 0.2 | 42.7 | 46.0 | 65.0 | 2019BU03630000 | 2019.0 | Burgwallen-Oude Zijde | Centrum |
3 | 4160 | GP | BU03630001 | Oude Kerk e.o. | 00 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 50.4 | 62.9 | 0.2 | 42.6 | 46.1 | 65.0 | 2019BU03630001 | 2019.0 | Burgwallen-Oude Zijde | Centrum |
4 | 4161 | GP | BU03630002 | Burgwallen Oost | 00 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 51.0 | 62.4 | 0.2 | 43.0 | 46.1 | 65.0 | 2019BU03630002 | 2019.0 | Burgwallen-Oude Zijde | Centrum |
5 | 4162 | GP | BU03630003 | Nes e.o. | 00 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 51.0 | 62.6 | 0.2 | 43.0 | 46.7 | 65.0 | 2019BU03630003 | 2019.0 | Burgwallen-Oude Zijde | Centrum |
6 | 4163 | GP | BU03630004 | BG-terrein e.o. | 00 | GM0363 | Amsterdam | 1 | NEE | 1012 | ... | 51.0 | 62.0 | 0.3 | 43.0 | 47.3 | 65.0 | 2019BU03630004 | 2019.0 | Burgwallen-Oude Zijde | Centrum |
5 rows × 208 columns
Data cleaning
Now we need to clean data. I went through the dataset and see that some of the columns contains elements like ‘X’ or ‘-‘, which prevent them from being numerical. Let’s fix it, I need this columns as a numbers. Also let’s replace ‘-99999999’ value with ‘0’:
data = data.copy()
data.replace(to_replace="X", value="0", inplace=True)
data.replace(to_replace="-", value="0", inplace=True)
data.replace(to_replace=-99999999, value=0, inplace=True)
for col_name in data.select_dtypes(include=[object]):
data[col_name] = data[col_name].str.replace(',','.')
data.fillna(0, inplace=True)
Now I want to convert everything I need to numbers:
cols=[i for i in data.columns if i not in ['geom','buurtcode', 'buurtnaam', 'wijkcode', 'gemeentecode', 'gemeentenaam',
'water', 'jaarstatcode', 'jrstatcode', 'wijknaam', 'stadsdeel']]
for col in cols:
data[col]=pd.to_numeric(data[col])
print(data.dtypes.to_list())
[dtype('int64'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('int64'), dtype('O'), dtype('int64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('O'), dtype('float64'), dtype('O'), dtype('O')]
Data preparation
Let’s add some new columns which I will need later to draw charts:
def calculate_totaal_from_perc(persentage, aantal):
return int(persentage * aantal / 100)
def calculate_totaal_from_av(avg, aantal):
return avg * aantal
data['huishoudens_met_kinderen'] = data.apply(lambda row: calculate_totaal_from_perc(row['percentage_huishoudens_met_kinderen'],
row['aantal_huishoudens']), axis=1)
data['percentage_migratieachtergrond'] = data.apply(lambda row: row['percentage_westerse_migratieachtergrond'] + row['percentage_niet_westerse_migratieachtergrond'], axis=1)
data['migratieachtergrond'] = data.apply(lambda row: calculate_totaal_from_perc(row['percentage_migratieachtergrond'],
row['aantal_inwoners']), axis=1)
data['mensen_van_nederland'] = data.apply(lambda row: calculate_totaal_from_perc(100 - row['percentage_migratieachtergrond'],
row['aantal_inwoners']), axis=1)
data['perc_mensen_van_nederland'] = data.apply(lambda row: 100 - row['percentage_migratieachtergrond'], axis=1)
data['westerse_migratieachtergrond'] = data.apply(lambda row: calculate_totaal_from_perc(row['percentage_westerse_migratieachtergrond'],
row['aantal_inwoners']), axis=1)
data['niet_westerse_migratieachtergrond'] = data.apply(lambda row: calculate_totaal_from_perc(row['percentage_niet_westerse_migratieachtergrond'],
row['aantal_inwoners']), axis=1)
And drop all the columns I won’t need for EDA:
cols_to_drop = [x for x in data.columns if "_binnen_" in x]
data.drop(cols_to_drop, axis=1, inplace = True)
As I noticed, dataset almost doesn’t have any info about Westpoort
district, so let’s remove it from the data we’re gonna use for explotatory data analysis. Also we don’t won’t rows where aantal_inwoners
is 0
print(data.shape)
data_eda = data[data['stadsdeel'] != "Westpoort"]
data_eda = data_eda[data_eda['aantal_inwoners'] != 0]
data_eda = data_eda[data_eda['bevolkingsdichtheid_inwoners_per_km2'] != 0]
print(data_eda.shape)
(470, 155)
(436, 155)
Also for EDA we doesn’t need columns which have descriptions of the neighbourhoods, so I will remove this data from the dataset too:
columns_to_drop = ['buurtcode',
'buurtnaam',
'wijkcode',
'gemeentecode',
'gemeentenaam',
'indelingswijziging_wijken_en_buurten',
'water',
'meest_voorkomende_postcode',
'dekkingspercentage',
'omgevingsadressendichtheid',
'stedelijkheid_adressen_per_km2',
'oppervlakte_totaal_in_ha',
'oppervlakte_land_in_ha',
'oppervlakte_water_in_ha',
'jaarstatcode',
'jaar',
'wijknaam'
]
data_eda = data_eda.drop(columns_to_drop, axis=1)
data_eda = data_eda.drop([ column for column in data_eda.columns if "_afstand_" in column ], axis=1)
data_eda.to_csv(r'wijkbuurtkaart_for_eda.csv', index = False)