Titanic Data Visualization with Tableau-Akanksha Goel
Tableau links
Before Feedback- https://public.tableau.com/profile/akanksha005#!/vizhome/TitanicDataVisualisation/Story1 After Feedback- https://public.tableau.com/profile/akanksha005#!/vizhome/TitanicDatasetVisualisationafter_feedback/Story1?publish=yes
Summary
The largest passenger liner in service at the time, Titanic had an estimated 2,224 people on board when she struck an iceberg at around 23:40 on Sunday, 14 April 1912. On Monday, 15 April resulted in the deaths of more than 1,500 people, which made it one of the deadliest peacetime maritime disasters in history. In this provided visualization, we’re going to see how several factors affect the survival rate of passengers.First we see which cabin was firstly evacuated and more people were saved. Then we take ticket class which will be our main study factor. Then we will add passengers’ sex and port of Embarkation to the ticket class to see the effect on the survival rate. We will also see how number of dependents and age groups affect the probability of survival.
Design
The whole story use: 1. Bar Graphs -In every bar chat, the y-axis shows the count of passengers, and the labels on top of the bars show the percentage or number of pessengers of the same single bar.As the whole story is focused on the people who survived or died from the accident,Bar charts easily help to see the count of people for particular category.It is easy to compare different categories of a variable. 2. Line Plots-Line plots are used for Quantitative continuous variables, which helps in finding the relationship between two variables.
Throughout the visualization, only three colors were used: Blue, orange, and Dark red gradient. Reasoning: This will help keep a consistent color encoding to make it easier to read the plots.
Feedback
The initial version of the visualization was shared with two co-workers. Below is the received feedback and the changes that I made based on the discussions we’ve had: In the cabin group vs (Dead vs Survivors) visualization the relation between cabin group and the number of survivors is bit unclear.Therefore made the required changes in the latest visualization. In the Port of Embarkation visualization the comparison between the number of people dead and survived cannot be seen properly.Therefore I changed the axis to vertically compare the number of survivors and perished people and added a lable telling the count of people dead and survived respectively. In the age group visualization the number of people in significant age group is not considered which may affect the complete observation. Therefore to consider i took average number of survivors on Y axis. In the second last visualization,don't understand how the percentile of survived can be 100% when there are 1 sibling.Therefore in reply replaced labels with actual probability of survival.
Resouces
The Titanic dataset can be downloaded from Dataset Options
Description
Data Dictionary
Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Cleaning the dataset
- Filling the missing values of age with the mean of the age.
#cleaning the dataset
import pandas as pd
Titanic_data=pd.read_csv('titanic-data.csv')
Mean_age=Titanic_data['Age'].mean()
Titanic_data['Age']= Titanic_data['Age'].fillna(Mean_age)
# print Titanic_data
df=pd.DataFrame(Titanic_data,columns=['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'])
# df.to_csv('titanic-data.csv', sep=',', encoding='utf-8')
print Titanic_data.head()
Unnamed: 0 PassengerId Survived Pclass \
0 0 1 0 3
1 1 2 1 1
2 2 3 1 3
3 3 4 1 1
4 4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
-
Grouping the age in bins of 10 yrs.
-
Dropping features that were not adding value to the purpose of the needed analysis such as: 'Ticket', 'Fare', 'Name', 'Pessenger ID'
-
Grouping the Cabin into ['A','B','C','D','E','F','other'] by using the starting Alphabet of cabin person belong to.