DATA VISUALIZATION
I. INTRODUCTION
Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. Additionally, it provides an excellent way for employees or business owners to present data to non-technical audiences without confusion. In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions.
Data visualization is essential for almost all careers. At present, a wide range of data analysis software is available, including spreadsheet software such as Excel, procedure-based systems such as SAS, user interface-based systems such as SPSS or R, and others. different data mining systems. In this report, we will use Rstudio tool to visualize data.
This report uses data to predict the health charges of US citizens based on some important features such as their ages, sex, BMI, number of children they have. From there, we can draw graphs, visualize and run models to identify factors that contribute to increases in health expenditures. This information may provide insight into risk factors and potential starting points for preventive measures.
This report has four major sections. To begin, the theoretical background will provide an overview of the literature information about the program ggplot2 on which the report is based. The second section of the report depicts data manipulation steps. Following that, data visualization and a description of each stage will be provided structurally and explained. Following the data visualization results, some comments and ideas derived from this report are given at the end of the document.
II. THEORETICAL BACKGROUND
A grammar of Graphics is a tool that enables us to concisely describe the components of a graphic. The grammar of graphics is implemented in R using the ggplot 2 package.
When working with the data, in order to visualize them at the first glance, ggplot2 is a popular choice for many analysts. ggplot2 is an R package for producing statistical, or data, graphics. Unlike most other graphics packages, ggplot2 has an underlying grammar, based on the Grammar of Graphics,1 that allows you to compose graphs by combining independent components. This makes ggplot2 powerful. Rather than being limited to sets of pre-defined graphics, you can create novel graphics that are tailored to your specific problem. Besides, ggplot2 provides beautiful, hassle-free plots that take care of fiddly details like drawing legends.
There are 7 grammatical elements in ‘ggplot2’ with the first three parts being essential.
ggplot(data, aesthetics, geometrics, facets, statistics, coordinates, theme)
Data that users want to visualize
Aesthetic mappings (aes) describe how properties of the data connect with features of the graph, such as distance along an axis, size, or color. It is most often when plotting.
Geometric objects (geoms) represent what users actually see on the plot: points, lines, polygons, etc.
A facetting: It partitions a plot into a matrix of panels. Each panel shows a different subset of the data.
Statistical transformations (stats): The name of the statistical transformation to use. A statistical transformation performs some useful statistical summary, and is key to histogram and smoothers.
A coordinate system describes how data coordinates are mapped to the plane of the graphic
A theme customizes the non-data components of plots i.e titles, labels, fonts, background, etc. It can be used to give plots a consistent customized look
By geometric feature, we can build main shape of plot, which can refer as:
geom_point (performance of scatter plot).
geom_smooth (adding a smoothed line to the plot to see the dominant pattern).
geom_boxplot (producing a box-and-whisker plot to summarize the distribution).
geom_histogram/geom_freqpoly (showing the distribution of continuous variables).
geom_bar (showing the distribution of categorical variables).
geom_path/geom_line (drawing lines between the data points that change over time).
III. DATA MANIPULATION
3.1. About the dataset
The dataset is taken from website kaggle.com under the public licence. The dataset includes 1338 observations (rows) and 7 common features (columns) which directly affect the health charges. The table below shows all labels, types, and descriptions of 7 variables.
| Order | Label | Type | Description |
|---|---|---|---|
| 1 | age | Integer | Insurance contractor’s age |
| 2 | sex | Factor | Insurance contractor’s gender |
| 3 | bmi | Numeric | Body mass index |
| 4 | children | Integer | Number of children covered by health insurance |
| 5 | smoker | Factor | Smoking |
| 6 | region | Factor | The beneficiary’s residential area in US |
| 7 | charges | Numeric | Individual medical costs billed by health insurance |
3.2. Descriptive Statistics
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
The summary of numeric varibles gives some important first sights of the data. There are no missing values in the dataset, poping or replacing NAs by mean/median is not required.
| Numeric_variables | Min | Mean | Max | NAs |
|---|---|---|---|---|
| age | 18 | 39.21 | 64 | 0 |
| bmi | 15.96 | 30.66 | 53.13 | 0 |
| children | 0 | 1.095 | 5 | 0 |
| charges | 1122 | 13270 | 63770 | 0 |
Descriptive statistics of categorical variables.
| Category | Count | Frequency |
|---|---|---|
| sex | ||
| female | 662 | 49.48% |
| male | 676 | 50.52% |
| smoker | ||
| no | 1064 | 79.52% |
| yes | 274 | 20.48% |
| region | ||
| northeast | 324 | 24.22% |
| northwest | 325 | 24.29% |
| southeast | 364 | 27.20% |
| southwest | 325 | 24.29% |
According to the table above:
- The number of male and female insurance contractors are almost equal
- In different geographical regions, the number of insurance policies is also equal
- One-fifth of the insured smokes, approximately 20.48%.
3.3. Handling data
Outlier values is an observation lies at a long distance from other values in the sample dataset. By looking at the boxplot below, outliers have characterized. There are several ways to handle these outlier values, such as remove them from the dataset, or replace by mean, median, which means adjust them to some acceptable values.
IV. DATA VISUALIZATION
The next step is visualisation to have an overview and familiar with the correlation between variables and detect outliers, patterns. This report will be divided into two main parts including creating figures about the single variable and multivariable.
4.1. Single variable
4.1.1. Factor variable
Firstly, the bar chart will show the descriptive statistics about insured’s gender
ggplot(data, aes(x="",fill=sex))+
geom_bar(width=0.5, stat='count')+
stat_count(geom="text",aes(label = stat(count)),vjust=4)+
scale_fill_manual(values= c("orange" , "seagreen"))+
theme_bw()+
labs(fill="Female or Male", title = "Female/Male")The bar chart illustrates the number of Male and Female. In detail, there are 676 male policyholders (accounts for 50.52%) ad 662 female policyholders (by 49.48%).
df1 <- data %>% group_by(region) %>% summarise(counts = n()) %>%
arrange(desc(region)) %>%
mutate(prop = round(counts*100/sum(counts), 1), lab.ypos = cumsum(prop) - 0.5*prop)
ggplot(df1, aes(x = "", y = prop, fill = region)) +
geom_bar(width = 1, stat = "identity", color = "white") +
geom_text(aes(y = lab.ypos, label = prop), color = "white")+
coord_polar("y", start = 0)+
scale_fill_brewer(palette="Dark2")+
theme_void() +
labs(title = "Pie chart of The beneficiary’s residential area", subtitle = "Unit: Percentage (%)")According to the pie chart of region, we can see the proportions of customer in each region. There are 4 areas listed in the data. The number of people buying insurance in each region is quite balanced. The group of people living in the northwest and southwest also accounts for the same proportion (24.3%). And 24.2% of the insureds are from the northeast, 27.2% of the insureds are from the southeast. It can be inferred from the chart that the frequency of each categories is quite similar and balanced, so it should be tested for the hypothesis that whether differences between frequencies of groups is significant.
With factor variable, another way can be used to show the descriptive statistic is using the linerange features of package ggplot2. The chart below shows the frequency of smoker and non-smoker.
df3<- data %>% group_by(smoker) %>% summarise(counts = n()) %>%
arrange(desc(smoker)) %>%
mutate(Percentage = round(counts*100/sum(counts), 1), lab.ypos = cumsum(Percentage) - 0.5*Percentage)
ggplot(df3, aes(smoker, Percentage)) +
geom_linerange(aes(x = smoker, ymin = 0, ymax = Percentage), color = "gray", size = 1.5) +
geom_point(aes(color = smoker), size = 3)+
scale_color_manual(values=c("palegreen3","orangered3"))+
theme_bw() +
labs(x=NULL, y=NULL,title = "Smoker", subtitle = "Unit: Percentage (%)")In general, we can generalize the frequency of smoker and non-smoker as follows: most insured in the dataset do not smoke, which accounts for nearly 80% of total. The number of smokers are over 20% of total. The percentage between two categories is quite different, there maybe significant impact between smokers and non-smokers on the insurance costs. However, it needed to be checked by the hypothesis, which need to be done later.
4.1.2. Numeric variable
ggplot(data, aes(x = charges)) +
geom_density(alpha = 0.5) +
ggtitle("Distribution of Charges")The above figure illustrates the probability density pattern of charges variable. In general, it can be seen that the data is in the interval of 1000 to 65000. In addition, this histogram has a right-skewed shape (positive skewness); the peak of the graph lies on the left side of the center. which shows the majority of data concentrate on the left-hand side of the plot, that is, the individual medical costs billed by health insurance are low, approximately 1000 to 15000.
When needed to test the difference between two groups or contribute to analyzing the regression models, it is obvious to form an important hypothesis of whether the considered numeric variable is distributed normally or not. Thus, two following plots are represented in order to test this hypothesis of bmi variable.
qq1<-ggplot(data, aes(bmi))+
geom_histogram(aes(y=..density..),color="black",fill="#69b3a2") +
stat_function(fun = dnorm,
args = list(mean = mean(data$bmi,na.rm = TRUE),
sd = sd(data$bmi,na.rm = TRUE)),
color ='black',size = 1) +
theme_bw() +
labs(title = "Histogram of BMI")
qq2<-ggplot(data, aes(sample = bmi)) +
stat_qq(color = "#69b3a2") +
stat_qq_line() +
theme_bw() +
labs(title = "QQPlot of BMI")
grid.arrange(qq1, qq2, ncol=2)The left figure is a histogram visualization based on the bmi variable. It is quite clear to see that its density pattern matches the curve (the representative of normal distribution), so it can be said that the bmi variable compels to normal distribution. For larger certainty,the right figure try to investigate QQplot. Once more, this plot also shows that the relatively large amount of observed data lies on the expected line of normal distribution.
4.2. Multivariable
After analyzing separately each variable, this part will move to simultaneous consideration of many variables. Multivariate analysis is where the fun as well as the complexity begins.
“The greatest value of a picture is when it forces us to notice what we never expected to see” (John Tukey).
Only by some pictures created from ggplot2 and different aesthetic attributes, multivariate analysis not only involves just checking out distributions but also potential relationships, patterns and correlations amongst these attributes. The inferential statistics and hypothesis testing also can be leveraged if necessary based on the problem to be solved at hand to check out statistical significance for different attributes, groups and so on.
4.2.1. Two factor variables
First of all, the following bar chart will evaluate the relationship between percentages of smokers and non-smokers in each areas.
ggplot(data, aes(x = factor(sex), fill = factor(smoker))) +
geom_bar(position = "fill") +
scale_fill_manual(values=c("palegreen3","orangered3")) +
theme_bw() +
ylab(NULL) + xlab("Gender") +
labs(fill = "Smoker", title = "Percentage of smoker and non-smoker on gender")What stands out from the figures is that in the group male customers, nearly 25% of smokers are male. In group female customers, the number of smokers is lower. This ratio is also very reasonable because in reality, more men smoke than women.
4.2.2. Two numeric variables
When working with two numeric variables, linear regression is an ideal method which analysts will use to explore the relationship between them. Before running model to make sure there is statistical relationship, data visualization can help to take an overview of variables. The scatter plot with smoothing line of package ggplot2 will show the relationship between Age and BMI of insured.
ggplot(data, aes(bmi, age)) +
geom_point(na.rm = TRUE, color="springgreen3",shape=9) +
geom_smooth(method="lm", color="red2") +
theme_bw() +
ylab("AGE") + xlab("BMI") +
labs(title = "Relationship between AGE and BMI of insured")It can be seen that the age and BMI variables have a very wide dispersion, the points on the graph are scattered everywhere. This shows that the two variables have a very low degree of correlation. That is, the age of the insured does not affect the BMI and vice versa. However, the linear regression method still gives a smoothing line that tends to go up from left to right. In general, this method still shows that there is a positive correlation between age and BMI index, age has an effect on BMI.
4.2.3. One factor and one numeric variable
A multiple density chart is a density chart where several groups are represented. It allows to compare different distribution. For better visualization, the density chart below will use transparency to compare distribution between insurance costs base on the areas.
ggplot(data, aes(x = charges, fill = region )) +
geom_density(alpha = 0.7) +
theme_bw() +
scale_fill_brewer(palette = "Dark2") +
labs(title = "Density of charges based on region")Overall, the charges variable range on the x-axis goes from 1,000 to approximately 65,000. All four region tend to have similar density distribution. Seem that it is difficult to determine the exact distribution for each area. Insurance costs mainly lies on the amount of approximately 5,000. This visualization shows intuitively and clearly that insurance costs of each area is balanced.
4.2.4. One factor and two numeric variables
When working with multivariable, scatter plot is an ideal one to deal with one factor and two numeric variables. The figure below show the relationship between Age, Charges and Smoking category.
p3= ggplot(data, aes(charges,age, col=smoker)) +
geom_point(shape=16) + xlab("charges") + ylab("age") + labs(color = "smoker: ") +
scale_color_manual(values=c("palegreen3","orangered3")) +
theme_bw() +
coord_flip() +
labs(title = "Plot of Insurance charges based on Age and smoking history")
ggplotly(p3)By using traditional scatter plots, comparing two numeric variables (Age and Charge) with one factor variable (Smoker) seems very reasonable in assessing the cost of health insurance.It is clear that non-smokers pay less charges than smokers.Non-smokers’ charges range from 0-40000 only, while smokers’ charges range from 10000 to over 60000. That means smoking has a strong influence on premiums. Besides, insurance charges tend to rise with age.
p4=ggplot(data, aes(charges, bmi, col=sex)) +
geom_point(shape = 1) +
scale_color_manual(values=c("deeppink2","cyan3")) +
coord_flip() + xlab("charges") + ylab("bmi") +
theme_bw() +
labs(color = "sex: ", title = "Plot of Insurance charges based on Sex and BMI os insured")
ggplotly(p4)Besides, the scatter plot will be used again to explore the relationship between charges, bmi based on sex variable. The following figure represents the relationship among three variables, namely charges, bmi and sex. We can see that two variables male and female are not separate. It is clear that the gender variable has almost no effect on others variables.
p <- plot_ly(data, x = ~age, y = ~bmi, z = ~charges)
p <- p %>% add_markers(color = ~region, colors ="Dark2")
p %>% layout(scene = list(xaxis = list(title = 'AGE'),
yaxis = list(title = 'BMI'),
zaxis = list(title = 'COSTS')),
annotations = list(
x = 1.13,
y = 1.05,
text = "regions",
xref = 'paper',
yref = 'paper',
showarrow = FALSE
))If we want to compare with more than 3 variables, 3D-scatter plot is the best choice. The data will be visualized for us to easily observe. The chart above shows the relationship between 3 numeric variables, age, bmi and costs, and 1 categorical variable is region. Different colored circles represent the respective regions. Put the mouse pointer on any point, we will get back the x, y, z coordinate values, respectively, age, bmi and premium.
4.2.5. One numeric and two factor variables
The final figure in this report will be the combination of two boxplots which illustrate the relationship between
p5<-ggplot(data, aes(x=smoker, y=charges, fill=sex)) +
geom_boxplot() +
labs(x = NULL, y = NULL, fill= NULL) +
scale_fill_manual(values=c("deeppink2","cyan3")) +
facet_wrap(~sex, ncol=2)
p6<-ggplot(data, aes(x=smoker, y=charges, fill=sex)) +
scale_fill_manual(values=c("deeppink2","cyan3")) +
geom_boxplot(show.legend = FALSE)
grid.arrange(p5,p6,ncol=2)Box charts are an effective tool for showing correlations between variables and their frequencies. In Figure 1, there are 2 separate graphs separated by Smoker. From here, we can draw the same conclusion as the scatter charts above that smokers of both sexes incur a much higher premium than non-smokers. In Figure 2, the two histograms are separated by Gender. Although the cost of insurance is quite similar, the number of people who pay high premiums because of smoking is much higher among men than women. This is understandable and consistent with the fact that men tend to smoke more than women. Conversely, if both men and women are non-smokers, a large number of men have lower premiums because of their stable health status. However, this difference is not significant.
ggcorr(data %>% mutate_if(is.factor, as.numeric), label = TRUE, label_size = 3,hjust = .85, size=3)The diagram shows the correlation between each pair of variables in data. In general, the correlation is quite low except for the correlation between “charges” and “smoker”. Thus, smoking have strong impact on insurance premium.
V. CONCLUSION
The report is done by using the dataset from 1,338 observations and 7 variables with the main characteristics of different customers and explore the relationship between variables. Firstly, several types of figures are created to show the statistics of single variable, for both factor and numeric. After that, the relationship between multivariable is explored, which included (1) two categorical variables, (2) two continuous variables, (3) one categorical and one continuous variable, (4) two categorical and one continuous variable, (5) two continuous and one categorical variable. The package ggplot2 and steps of descriptive statistics on Rstudio tool have obtained an overview of the dataset “health insurance cost” including trends, outliers, and correlations between variables. Next steps can be done are proceeding to build a assessment model and make judgments.
In conclusion, we can state that throughout this project we have met our objectives. We have figured out the features or variables which play vital role in the amount of health insurance as well as the prediction model to help the companies and the customers. For further analysis, dataset with more features will provide comparatively more accurate outputs.