Skip to the content.

Skills Network Logo

SpaceX Falcon 9 First Stage Landing Prediction

Assignment: Exploring and Preparing Data

Estimated time needed: 70 minutes

In this assignment, we will predict if the Falcon 9 first stage will land successfully. SpaceX advertises Falcon 9 rocket launches on its website with a cost of 62 million dollars; other providers cost upward of 165 million dollars each, much of the savings is due to the fact that SpaceX can reuse the first stage.

In this lab, you will perform Exploratory Data Analysis and Feature Engineering.

Falcon 9 first stage will land successfully

Several examples of an unsuccessful landing are shown here:

Most unsuccessful landings are planned. Space X performs a controlled landing in the oceans.

Objectives

Perform exploratory Data Analysis and Feature Engineering using Pandas and Matplotlib


Import Libraries and Define Auxiliary Functions

We will import the following libraries the lab

# andas is a software library written for the Python programming language for data manipulation and analysis.
import pandas as pd
#NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
import numpy as np
# Matplotlib is a plotting library for python and pyplot gives us a MatLab like plotting framework. We will use this in our plotter function to plot data.
import matplotlib.pyplot as plt
#Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics
import seaborn as sns

Exploratory Data Analysis

First, let’s read the SpaceX dataset into a Pandas dataframe and print its summary

df=pd.read_csv("dataset_part_2.csv")

# If you were unable to complete the previous lab correctly you can uncomment and load this csv

# df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/api/dataset_part_2.csv')

df.head(5)
FlightNumber Date BoosterVersion PayloadMass Payload Orbit LaunchSite Outcome Flights GridFins Reused Legs LandingPad Block ReusedCount Serial Longitude Latitude Class
0 1 2010-06-04 Falcon 9 6123.547647 Dragon Qualification Unit LEO CCSFS SLC 40 None None 1 False False False NaN 1.0 0 B0003 -80.577366 28.561857 0
1 2 2012-05-22 Falcon 9 525.000000 COTS Demo Flight 2 LEO CCSFS SLC 40 None None 1 False False False NaN 1.0 0 B0005 -80.577366 28.561857 0
2 3 2013-03-01 Falcon 9 677.000000 CRS-2 ISS CCSFS SLC 40 None None 1 False False False NaN 1.0 0 B0007 -80.577366 28.561857 0
3 4 2013-09-29 Falcon 9 500.000000 CASSIOPE PO VAFB SLC 4E False Ocean 1 False False False NaN 1.0 0 B1003 -120.610829 34.632093 0
4 5 2013-12-03 Falcon 9 3170.000000 SES-8 GTO CCSFS SLC 40 None None 1 False False False NaN 1.0 0 B1004 -80.577366 28.561857 0

First, let’s try to see how the FlightNumber (indicating the continuous launch attempts.) and Payload variables would affect the launch outcome.

We can plot out the FlightNumber vs. PayloadMassand overlay the outcome of the launch. We see that as the flight number increases, the first stage is more likely to land successfully. The payload mass is also important; it seems the more massive the payload, the less likely the first stage will return.

sns.catplot(y="PayloadMass", x="FlightNumber", hue="Class", data=df, aspect = 3)
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("Pay load Mass (kg)",fontsize=20)
plt.show()

png

We see that different launch sites have different success rates. CCAFS LC-40, has a success rate of 60 %, while KSC LC-39A and VAFB SLC 4E has a success rate of 77%.

Next, let’s drill down to each site visualize its detailed launch records.

TASK 1: Visualize the relationship between Flight Number and Launch Site

Use the function catplot to plot FlightNumber vs LaunchSite, set the parameter x parameter to FlightNumber,set the y to Launch Site and set the parameter hue to 'class'

# Plot a scatter point chart with x axis to be Flight Number and y axis to be the launch site, and hue to be the class value
sns.catplot(y="LaunchSite", x="FlightNumber", hue="Class", data=df, aspect = 5)
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("Launch Site",fontsize=20)
plt.show()

png

On this scatterplot I can see that for most of the launches was used CCAFS LC-40 launch site, but I don’t see any correlation between this site and number of flight. Launch sites KSC LC-39A and VAFB SLC 4E were used less for launches, but at the same time they have better success rates.

TASK 2: Visualize the relationship between Payload and Launch Site

We also want to observe if there is any relationship between launch sites and their payload mass.

# Plot a scatter point chart with x axis to be Pay Load Mass (kg) and y axis to be the launch site, and hue to be the class value
sns.catplot(y="LaunchSite", x="PayloadMass", hue="Class", data=df, aspect = 3)
plt.xlabel("Pay Load Mass (kg)",fontsize=20)
plt.ylabel("Launch Site",fontsize=20)
plt.show()

png

Now if you observe Payload Vs. Launch Site scatter point chart you will find for the VAFB-SLC launchsite there are no rockets launched for heavypayload mass(greater than 10000).

TASK 3: Visualize the relationship between success rate of each orbit type

Next, we want to visually check if there are any relationship between success rate and orbit type.

Let’s create a bar chart for the sucess rate of each orbit

# HINT use groupby method on Orbit column and get the mean of Class column
group_df = df.groupby(['Orbit']).agg(mean_class=("Class", 'mean'))
group_df = group_df.reset_index()
group_df
Orbit mean_class
0 ES-L1 1.000000
1 GEO 1.000000
2 GTO 0.518519
3 HEO 1.000000
4 ISS 0.619048
5 LEO 0.714286
6 MEO 0.666667
7 PO 0.666667
8 SO 0.000000
9 SSO 1.000000
10 VLEO 0.857143
sns.barplot(x='Orbit', y="mean_class", data=group_df)
plt.xlabel("Pay Load Mass (kg)",fontsize=20)
plt.ylabel("Launch Site",fontsize=20)
plt.show()

png

We can see that as lower orbit is, as better success rate it has.

TASK 4: Visualize the relationship between FlightNumber and Orbit type

For each orbit, we want to see if there is any relationship between FlightNumber and Orbit type.

# Plot a scatter point chart with x axis to be FlightNumber and y axis to be the Orbit, and hue to be the class value
sns.catplot(y="Orbit", x="FlightNumber", hue="Class", data=df, aspect = 2)
plt.xlabel("Flight Number",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

png

You should see that in the LEO orbit the Success appears related to the number of flights; on the other hand, there seems to be no relationship between flight number when in GTO orbit.

TASK 5: Visualize the relationship between Payload and Orbit type

Similarly, we can plot the Payload vs. Orbit scatter point charts to reveal the relationship between Payload and Orbit type

# Plot a scatter point chart with x axis to be Payload and y axis to be the Orbit, and hue to be the class value
sns.catplot(y="Orbit", x="PayloadMass", hue="Class", data=df, aspect = 2)
plt.xlabel("Payload Mass (kg)",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

png

With heavy payloads the successful landing or positive landing rate are more for Polar,LEO and ISS.

However for GTO we cannot distinguish this well as both positive landing rate and negative landing(unsuccessful mission) are both there here.

TASK 6: Visualize the launch success yearly trend

You can plot a line chart with x axis to be Year and y axis to be average success rate, to get the average launch success trend.

The function will help you get the year from the date:

# A function to Extract years from the date 
year=[]
def Extract_year():
    for i in df["Date"]:
        year.append(i.split("-")[0])
    return year
df['Year'] = Extract_year()

year_df = df.groupby(['Year']).agg(mean_class=("Class", 'mean'))
year_df = year_df.reset_index()
year_df
Year mean_class
0 2010 0.000000
1 2012 0.000000
2 2013 0.000000
3 2014 0.333333
4 2015 0.333333
5 2016 0.625000
6 2017 0.833333
7 2018 0.611111
8 2019 0.900000
9 2020 0.842105
# Plot a line chart with x axis to be the extracted year and y axis to be the success rate
sns.lineplot(x = "Year", y = "mean_class", data=year_df)
plt.xlabel("Year",fontsize=20)
plt.ylabel("Orbit",fontsize=20)
plt.show()

png

you can observe that the sucess rate since 2013 kept increasing till 2020

Features Engineering

By now, you should obtain some preliminary insights about how each important variable would affect the success rate, we will select the features that will be used in success prediction in the future module.

features = df[['FlightNumber', 'PayloadMass', 'Orbit', 'LaunchSite', 'Flights', 'GridFins', 'Reused', 'Legs', 'LandingPad', 'Block', 'ReusedCount', 'Serial']]
features.head()
FlightNumber PayloadMass Orbit LaunchSite Flights GridFins Reused Legs LandingPad Block ReusedCount Serial
0 1 6123.547647 LEO CCSFS SLC 40 1 False False False NaN 1.0 0 B0003
1 2 525.000000 LEO CCSFS SLC 40 1 False False False NaN 1.0 0 B0005
2 3 677.000000 ISS CCSFS SLC 40 1 False False False NaN 1.0 0 B0007
3 4 500.000000 PO VAFB SLC 4E 1 False False False NaN 1.0 0 B1003
4 5 3170.000000 GTO CCSFS SLC 40 1 False False False NaN 1.0 0 B1004

TASK 7: Create dummy variables to categorical columns

Use the function get_dummies and features dataframe to apply OneHotEncoder to the column Orbits, LaunchSite, LandingPad, and Serial. Assign the value to the variable features_one_hot, display the results using the method head. Your result dataframe must include all features including the encoded ones.

# HINT: Use get_dummies() function on the categorical columns
features_one_hot = pd.get_dummies(data=features, columns=['Orbit', 'LaunchSite', 'LandingPad', 'Serial'])
features_one_hot.head()
FlightNumber PayloadMass Flights GridFins Reused Legs Block ReusedCount Orbit_ES-L1 Orbit_GEO ... Serial_B1048 Serial_B1049 Serial_B1050 Serial_B1051 Serial_B1054 Serial_B1056 Serial_B1058 Serial_B1059 Serial_B1060 Serial_B1062
0 1 6123.547647 1 False False False 1.0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 2 525.000000 1 False False False 1.0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 3 677.000000 1 False False False 1.0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 500.000000 1 False False False 1.0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 5 3170.000000 1 False False False 1.0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 80 columns

TASK 8: Cast all numeric columns to float64

Now that our features_one_hot dataframe only contains numbers cast the entire dataframe to variable type float64

# HINT: use astype function
features_one_hot.astype(np.float64)
FlightNumber PayloadMass Flights GridFins Reused Legs Block ReusedCount Orbit_ES-L1 Orbit_GEO ... Serial_B1048 Serial_B1049 Serial_B1050 Serial_B1051 Serial_B1054 Serial_B1056 Serial_B1058 Serial_B1059 Serial_B1060 Serial_B1062
0 1.0 6123.547647 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2.0 525.000000 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3.0 677.000000 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4.0 500.000000 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5.0 3170.000000 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
85 86.0 15600.000000 2.0 1.0 1.0 1.0 5.0 12.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
86 87.0 15600.000000 3.0 1.0 1.0 1.0 5.0 13.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
87 88.0 15600.000000 6.0 1.0 1.0 1.0 5.0 12.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
88 89.0 15600.000000 3.0 1.0 1.0 1.0 5.0 12.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
89 90.0 3681.000000 1.0 1.0 0.0 1.0 5.0 8.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

90 rows × 80 columns

We can now export it to a CSV for the next section,but to make the answers consistent, in the next lab we will provide data in a pre-selected date range.

features_one_hot.to_csv('dataset_part\_3.csv', index=False)

features_one_hot.to_csv('dataset_part_3.csv', index=False)

Authors

Joseph Santarcangelo has a PhD in Electrical Engineering, his research focused on using machine learning, signal processing, and computer vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.

Nayef Abou Tayoun is a Data Scientist at IBM and pursuing a Master of Management in Artificial intelligence degree at Queen’s University.

Change Log

Date (YYYY-MM-DD) Version Changed By Change Description
2021-10-12 1.1 Lakshmi Holla Modified markdown
2020-09-20 1.0 Joseph Modified Multiple Areas
2020-11-10 1.1 Nayef updating the input data

Copyright © 2020 IBM Corporation. All rights reserved.