Data collection

I would like to show you where I got data, how I collected, investigated and clean dataset for explotatory data analysis of the Amsterdam residents.

There is a Wijk- en buurtkaart 2019 dataset which everybody could download. It’s quite big, so I filtered it to have only rows about Amsterdam neighbourhoods. The result dataset was saved in ams_buurten_2019.csv file.

Let’s take a look at what we have there:

import pandas as pd 
import geopandas as gpd

data = pd.read_csv('ams_buurten_2019.csv', sep = ";")
data = data[data["water"] == "NEE"]
print(data.shape)
data.head()

(470, 206)

	fid	geom	buurtcode	buurtnaam	wijkcode	gemeentecode	gemeentenaam	indelingswijziging_wijken_en_buurten	water	meest_voorkomende_postcode	...	theater_gemiddelde_afstand_in_km	theater_gemiddeld_aantal_binnen_5_km	theater_gemiddeld_aantal_binnen_10_km	theater_gemiddeld_aantal_binnen_20_km	gemiddelde_afstand_tot_museum	gemiddeld_aantal_musea_binnen_5_km	gemiddeld_aantal_musea_binnen_10_km	gemiddeld_aantal_musea_binnen_20_km	jaarstatcode	jaar
2	4159	GP	BU03630000	Kop Zeedijk	WK036300	GM0363	Amsterdam	1	NEE	1012	...	0.6	37.2	49.3	63.0	0.2	42.7	46.0	65.0	2019BU03630000	2019.0
3	4160	GP	BU03630001	Oude Kerk e.o.	WK036300	GM0363	Amsterdam	1	NEE	1012	...	0.5	37.0	50.4	62.9	0.2	42.6	46.1	65.0	2019BU03630001	2019.0
4	4161	GP	BU03630002	Burgwallen Oost	WK036300	GM0363	Amsterdam	1	NEE	1012	...	0.3	37.0	51.0	62.4	0.2	43.0	46.1	65.0	2019BU03630002	2019.0
5	4162	GP	BU03630003	Nes e.o.	WK036300	GM0363	Amsterdam	1	NEE	1012	...	0.2	37.0	51.0	62.6	0.2	43.0	46.7	65.0	2019BU03630003	2019.0
6	4163	GP	BU03630004	BG-terrein e.o.	WK036300	GM0363	Amsterdam	1	NEE	1012	...	0.2	37.0	51.0	62.0	0.3	43.0	47.3	65.0	2019BU03630004	2019.0

5 rows × 206 columns

Now I want to add bigger neighbourhoods called wijk in dutch, so I would be able to add districts (called stadsdelen).

From the Wijk- en buurtkaart 2019 dataset mentioned above I also got a dataset wijken.csv, but I only need wijknaam and wijkcode from it. Let’s get it and then add wijknaam to the main dataset:

new_data = pd.read_csv('wijken.csv', sep = ";")
data_wijk = new_data[new_data['gemeentenaam'] == "Amsterdam"]
data_wijk = data_wijk[data_wijk['wijknaam'] != ' ']
data_wijk = data_wijk[['wijkcode', 'wijknaam']]
print(data_wijk.shape)
data_wijk.head()

(99, 2)

	wijkcode	wijknaam
905	WK036300	Burgwallen-Oude Zijde
906	WK036301	Burgwallen-Nieuwe Zijde
907	WK036302	Grachtengordel-West
908	WK036303	Grachtengordel-Zuid
909	WK036304	Nieuwmarkt/Lastage

#data = pd.merge(data, data_wijk, on="wijkcode")
def get_wijk(row):
    data_row = data_wijk[data_wijk['wijkcode'] == row['wijkcode']]
    value = "" if data_row.empty else data_row['wijknaam'].iloc[0]
    return value
    

data = data.copy()

data['wijknaam'] = data.apply(lambda row: get_wijk(row), axis=1)
data['wijkcode'] = data['wijkcode'].astype(str).str.replace('WK0363', '')
print(data.shape)
data.head()

(470, 207)

	fid	geom	buurtcode	buurtnaam	gemeentecode	gemeentenaam	indelingswijziging_wijken_en_buurten	water	meest_voorkomende_postcode	...	theater_gemiddeld_aantal_binnen_5_km	theater_gemiddeld_aantal_binnen_10_km	theater_gemiddeld_aantal_binnen_20_km	gemiddelde_afstand_tot_museum	gemiddeld_aantal_musea_binnen_5_km	gemiddeld_aantal_musea_binnen_10_km	gemiddeld_aantal_musea_binnen_20_km	jaarstatcode	jaar	wijknaam
2	4159	GP	BU03630000	Kop Zeedijk	GM0363	Amsterdam	1	NEE	1012	...	37.2	49.3	63.0	0.2	42.7	46.0	65.0	2019BU03630000	2019.0	Burgwallen-Oude Zijde
3	4160	GP	BU03630001	Oude Kerk e.o.	GM0363	Amsterdam	1	NEE	1012	...	37.0	50.4	62.9	0.2	42.6	46.1	65.0	2019BU03630001	2019.0	Burgwallen-Oude Zijde
4	4161	GP	BU03630002	Burgwallen Oost	GM0363	Amsterdam	1	NEE	1012	...	37.0	51.0	62.4	0.2	43.0	46.1	65.0	2019BU03630002	2019.0	Burgwallen-Oude Zijde
5	4162	GP	BU03630003	Nes e.o.	GM0363	Amsterdam	1	NEE	1012	...	37.0	51.0	62.6	0.2	43.0	46.7	65.0	2019BU03630003	2019.0	Burgwallen-Oude Zijde
6	4163	GP	BU03630004	BG-terrein e.o.	GM0363	Amsterdam	1	NEE	1012	...	37.0	51.0	62.0	0.3	43.0	47.3	65.0	2019BU03630004	2019.0	Burgwallen-Oude Zijde

5 rows × 207 columns

From this website (from the parts called ‘Buurten OUD’ and ‘Stadsdelen OUD’) I got datasets which describe how Amsterdam divided by buurten and stadsdelen.

Let’s open it, prepare and add info about city districts (stadsdelen) from it to the main dataset:

import re
data_wijk_stadsdeelcode = pd.read_csv('GEBIED_BUURTEN.csv', sep = ";")
data_wijk_stadsdeelcode = data_wijk_stadsdeelcode[['Wijkcode', 'Stadsdeelcode']]
data_wijk_stadsdeelcode = data_wijk_stadsdeelcode.rename(columns={'Wijkcode': 'wijkcode'})
data_wijk_stadsdeelcode = data_wijk_stadsdeelcode.drop_duplicates(subset=["wijkcode"], keep='first')

print(data_wijk_stadsdeelcode.shape)
data_wijk_stadsdeelcode.head()

(107, 2)

	wijkcode	Stadsdeelcode
0	A00	A
3	A01	A
11	A02	A
14	A03	A
21	A04	A

data_stadsdelen = pd.read_csv('GEBIED_STADSDELEN.csv', sep = ";")
data_stadsdelen = data_stadsdelen[['Stadsdeelcode', 'Stadsdeelnaam']]
data_stadsdelen = data_stadsdelen.rename(columns={'Stadsdeelnaam': 'stadsdeel'})
#print(data_stadsdelen.shape)
data_stadsdelen.head()

	Stadsdeelcode	stadsdeel
0	A	Centrum
1	B	Westpoort
2	E	West
3	F	Nieuw-West
4	K	Zuid

print(data_stadsdelen.shape)
data_stadsdelen = pd.merge(data_stadsdelen, data_wijk_stadsdeelcode, on="Stadsdeelcode")
print(data_stadsdelen.shape)
data_stadsdelen.head()

(9, 2)
(107, 3)

	Stadsdeelcode	stadsdeel	wijkcode
0	A	Centrum	A00
1	A	Centrum	A01
2	A	Centrum	A02
3	A	Centrum	A03
4	A	Centrum	A04

data_stadsdelen['wijkcode'] = data_stadsdelen['wijkcode'].astype(str).str.replace('\D', 'WK0363', regex=True)
data_stadsdelen.to_excel('amsterdam_stadsdelen.xlsx', index = None, header=True)
data_stadsdelen['wijkcode'] = data_stadsdelen['wijkcode'].astype(str).str.replace('WK0363', '', regex=True)
print(data_stadsdelen.shape)
data_stadsdelen.head()

(107, 3)

	Stadsdeelcode	stadsdeel	wijkcode
0	A	Centrum	00
1	A	Centrum	01
2	A	Centrum	02
3	A	Centrum	03
4	A	Centrum	04

#data = pd.merge(data, data_wijk, on="wijkcode")
def get_stadsdeel(row):
    #print("wijkcode: " +  row['wijkcode'] + " stadsdeel: " + )
    data_row = data_stadsdelen[data_stadsdelen['wijkcode'] == row['wijkcode']]
    value = "" if data_row.empty else data_row['stadsdeel'].iloc[0]
    return value
    

data = data.copy()

data['stadsdeel'] = data.apply(lambda row: get_stadsdeel(row), axis=1)
print(data.shape)
data.head()

(470, 208)

	fid	geom	buurtcode	buurtnaam	gemeentecode	gemeentenaam	indelingswijziging_wijken_en_buurten	water	meest_voorkomende_postcode	...	theater_gemiddeld_aantal_binnen_10_km	theater_gemiddeld_aantal_binnen_20_km	gemiddelde_afstand_tot_museum	gemiddeld_aantal_musea_binnen_5_km	gemiddeld_aantal_musea_binnen_10_km	gemiddeld_aantal_musea_binnen_20_km	jaarstatcode	jaar	wijknaam	stadsdeel
2	4159	GP	BU03630000	Kop Zeedijk	GM0363	Amsterdam	1	NEE	1012	...	49.3	63.0	0.2	42.7	46.0	65.0	2019BU03630000	2019.0	Burgwallen-Oude Zijde	Centrum
3	4160	GP	BU03630001	Oude Kerk e.o.	GM0363	Amsterdam	1	NEE	1012	...	50.4	62.9	0.2	42.6	46.1	65.0	2019BU03630001	2019.0	Burgwallen-Oude Zijde	Centrum
4	4161	GP	BU03630002	Burgwallen Oost	GM0363	Amsterdam	1	NEE	1012	...	51.0	62.4	0.2	43.0	46.1	65.0	2019BU03630002	2019.0	Burgwallen-Oude Zijde	Centrum
5	4162	GP	BU03630003	Nes e.o.	GM0363	Amsterdam	1	NEE	1012	...	51.0	62.6	0.2	43.0	46.7	65.0	2019BU03630003	2019.0	Burgwallen-Oude Zijde	Centrum
6	4163	GP	BU03630004	BG-terrein e.o.	GM0363	Amsterdam	1	NEE	1012	...	51.0	62.0	0.3	43.0	47.3	65.0	2019BU03630004	2019.0	Burgwallen-Oude Zijde	Centrum

5 rows × 208 columns

Data cleaning

Now we need to clean data. I went through the dataset and see that some of the columns contains elements like ‘X’ or ‘-‘, which prevent them from being numerical. Let’s fix it, I need this columns as a numbers. Also let’s replace ‘-99999999’ value with ‘0’:

data = data.copy()
data.replace(to_replace="X", value="0", inplace=True)
data.replace(to_replace="-", value="0", inplace=True)
data.replace(to_replace=-99999999, value=0, inplace=True)

for col_name in data.select_dtypes(include=[object]):
    data[col_name] = data[col_name].str.replace(',','.')
    
data.fillna(0, inplace=True)

Now I want to convert everything I need to numbers:

cols=[i for i in data.columns if i not in ['geom','buurtcode', 'buurtnaam', 'wijkcode', 'gemeentecode', 'gemeentenaam', 
                                           'water', 'jaarstatcode', 'jrstatcode', 'wijknaam', 'stadsdeel']]
for col in cols:
    data[col]=pd.to_numeric(data[col])

print(data.dtypes.to_list())

[dtype('int64'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('O'), dtype('int64'), dtype('O'), dtype('int64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('float64'), dtype('O'), dtype('float64'), dtype('O'), dtype('O')]

Data preparation

Let’s add some new columns which I will need later to draw charts:

def calculate_totaal_from_perc(persentage, aantal):
    return int(persentage * aantal / 100)

def calculate_totaal_from_av(avg, aantal):
    return avg * aantal

data['huishoudens_met_kinderen'] = data.apply(lambda row: calculate_totaal_from_perc(row['percentage_huishoudens_met_kinderen'],
                                                                            row['aantal_huishoudens']), axis=1)
data['percentage_migratieachtergrond'] = data.apply(lambda row: row['percentage_westerse_migratieachtergrond'] + row['percentage_niet_westerse_migratieachtergrond'], axis=1)

data['migratieachtergrond'] = data.apply(lambda row: calculate_totaal_from_perc(row['percentage_migratieachtergrond'],
                                                                            row['aantal_inwoners']), axis=1)

data['mensen_van_nederland'] = data.apply(lambda row: calculate_totaal_from_perc(100 - row['percentage_migratieachtergrond'],
                                                                            row['aantal_inwoners']), axis=1)
data['perc_mensen_van_nederland'] = data.apply(lambda row: 100 - row['percentage_migratieachtergrond'], axis=1)

data['westerse_migratieachtergrond'] = data.apply(lambda row: calculate_totaal_from_perc(row['percentage_westerse_migratieachtergrond'],
                                                                            row['aantal_inwoners']), axis=1)
data['niet_westerse_migratieachtergrond'] = data.apply(lambda row: calculate_totaal_from_perc(row['percentage_niet_westerse_migratieachtergrond'],
                                                                            row['aantal_inwoners']), axis=1)

And drop all the columns I won’t need for EDA:

cols_to_drop = [x for x in data.columns if "_binnen_" in x]
data.drop(cols_to_drop, axis=1, inplace = True)

As I noticed, dataset almost doesn’t have any info about Westpoort district, so let’s remove it from the data we’re gonna use for explotatory data analysis. Also we don’t won’t rows where aantal_inwoners is 0

print(data.shape)
data_eda = data[data['stadsdeel'] != "Westpoort"]
data_eda = data_eda[data_eda['aantal_inwoners'] != 0]
data_eda = data_eda[data_eda['bevolkingsdichtheid_inwoners_per_km2'] != 0]
print(data_eda.shape)

(470, 155)
(436, 155)

Also for EDA we doesn’t need columns which have descriptions of the neighbourhoods, so I will remove this data from the dataset too:

columns_to_drop = ['buurtcode',
                   'buurtnaam',
                   'wijkcode',
                   'gemeentecode',
                   'gemeentenaam',
                   'indelingswijziging_wijken_en_buurten',
                   'water',
                   'meest_voorkomende_postcode',
                   'dekkingspercentage',
                   'omgevingsadressendichtheid',
                   'stedelijkheid_adressen_per_km2',
                   'oppervlakte_totaal_in_ha',
                   'oppervlakte_land_in_ha',
                   'oppervlakte_water_in_ha',
                   'jaarstatcode',
                   'jaar',
                   'wijknaam'
                  ]
data_eda = data_eda.drop(columns_to_drop, axis=1)
data_eda = data_eda.drop([ column for column in data_eda.columns if "_afstand_" in column ], axis=1)

data_eda.to_csv(r'wijkbuurtkaart_for_eda.csv', index = False)