Project 2 Medical Insurance Costs

Code: GitHub

2.1 Introduction

There is no doubt that health-related benefits are among the most important that a company can offer its employees. After all, they help significantly reduce medical expenses and guarantee the necessary support in times of difficulty. All of this reflects on the collective well-being and brings positive results.

Among the significant advantages of hiring health insurance in organizations, we can mention the greater engagement of the team. This is because employees who access quality medical care work with more attention, tranquility, and disposition, achieving their best performance during working hours.

Because Medical insurance is a much-desired benefit, its offer also helps attract and retain the best professionals in the market. Moreover, with great talents working in the company, it tends to become a benchmark for quality in what it does.

In this project, we analyze data from US Medical Insurance Costs. Recently, many people have been looking for this type of insurance, and prices vary widely according to the user’s data. The main question of this analysis is to answer the final price that users need to pay.

The dataset insurance.csv was provided by Kaggle.com: Medical Cost Personal Datasets

2.1.1 Project Goals

This project is making a case study using the linear regression method on the Medical Insurance Costs dataset provided by the Kaggle.

2.1.2 Analysis

This solution uses descriptive and inferential statistics and data visualization to find key figures in understanding the distribution, count, and relationship between variables.

2.1.3 Evaluation

The project concludes with the Evaluation of a linear model. A simple case study is done for one variable, and multiple variables are focused on visualizing the adjusted models then looking at the residual plots and the error band.

2.2 Data

The project has one dataset provided by Kaggle called insurance.csv. Each row represents a user in the data, and the columns are the responses to their user profiles, including data about them.

insurance has 1,338 rows and 7 columns, this is a good amount of data to analyze.

Each line, or sample, consists of the following features:

TABLE 2.1: Description of variables in the dataset.

Variable Description
age Age of primary beneficiary
sex Client’s gender
bmi Body mass index of the client. BMI is a measure of a person’s weight with respect to their height.
children Numbers of children depending on the client
smoker smoker yes or no
region Region of which the customer is a part
charges Annual charge of how much the customer must pay

2.3 Exploring the Data

Now let us look at the first five lines of the Data Frame that we want to analyze.

TABLE 2.2: The 5 first lines of the Data Frame Insurance.

age sex bmi children smoker region charges
19 female 27.9 0 yes southwest 16884.9
18 male 33.77 1 no southeast 1725.55
28 male 33 3 no southeast 4449.46
33 male 22.705 0 no northwest 21984.5
32 male 28.88 0 no northwest 3866.86

Before doing a deeper analysis of the data, we can see some of the data and how it is distributed from graphs.

We started by looking at the Age Histogram.

FIGURE 2.1: The distribution of the Age Histogram.

Then, we can look at the scatter graph in the age variable x charge and separate between smokers and non-smokers, as we believe it has a significant impact on the value of the final price charged.

FIGURE 2.2: Scatter plot of chargers vs age between smokers and non-smokers. No-Smoker = 0 and Smoker = 1

As can be seen in FIGURE 2.2, we have a leap in value when we analyze the graphs between smokers and non-smokers. Furthermore, we can see that there is a linear growth between age and the amount charged.

Next, let us look at the BMI distribution:

FIGURE 2.3: The BMI distribution.

The BMI values are all well distributed in Gaussian. We see that there are no outliers to be limited.

Knowing that BMI has a well-behaved distribution, we will look at how BMI is influenced by the processed value between treatment and untreated.

FFIGURE 2.4: Scatter plot of charges vs. BMI between smokers and non-smokers.

Analyzing FIGURE 2.1, we can see a linear growth between BMI and the amount charged when smokers, which is not the case for non-smokers.

Taking a first look at the data provided and some plotted graphs, we can see that there are probably relationships between some variables provided and the charge amount. Next, we will investigate it.

2.4 Linear Regression

We initially focus on regression models where the amount charged is the dependent variable. That is, we predict the amount charged based on other variables.

Since the feature charge is a quantitative variable, we model it using linear regression.

2.4.1 Interpreting regression parameters in a basic model

We start with a simple linear regression model with only one covariable, age, to predict the charge. The adjusted model expresses the amount charged as a linear function of age.

TABLE 2.3: Linear Regression Model adjusted for age variable using the python statsmodels library.

The part that starts to be relevant to the Table presented is from coef. This section contains the estimated values for the parameters of the regression model. Its pattern, errors, and other values are used to quantify the uncertainty in the estimates of the regression parameters.

This adjusted model implies that when comparing two people of different ages in one year, the oldest person will have an average of 257.72 units of charges larger than the youngest person. This difference is statistically significant, based on the p-value shown in the column labeled. ‘P> | t |’. This means that there is strong evidence that there is a natural association between age and charges made by the customer.

To better understand the meaning of the regression parameter 257.7226, we can look at the standard deviation of charges.

The standard deviation of around $ 12,110 is significant compared to the slope of the 258 regression. However, the regression slope corresponds to the average change in collection for a single year of age, and this effect accumulates with age. Comparing a 40 year old person with a 60 year old person, there is a 20-year difference, which translates into ’20 * 258 = 5,160 ’unit difference in the average value charged between these two people. This difference is about less than half a standard deviation and is generally considered to be a significant change.

2.4.2 R-square

In the case of regression with a single independent variable, as we have here, there is a very close correspondence between the regression analysis and Pearson’s correlation analysis. The R-square is a statistical measure of how close the data is to the adjusted regression line. The value obtained was 0.09, which means that the age of the customers explains 9% of the variation in the amount charged.

2.4.3 Adding more variables to the model

We use all the variables after analyzing a simple linear regression model with only one covariant (age). The real power of regression happens when we have more than one covariant to predict an output.

TABLE 2.4: Linear Regression Model adjusted for all variables using the python statsmodels library.

The model adjusted in TABLE 2.4 uses all parameters to explain the variation in the amount charged. We can see how attractive the value of R-square is. If we look at the table, we see a value of 0.75. That is, we have a linear regression model that 75% of the variation in the amount charged is probably explained by the variables provided.

Looking at the age coefficient, which we studied earlier, there were no changes, even adding many other variables. Thus, we can notice that the age variable has no relation with the other variables provided in the model. It is possible to confirm this result by looking at the value of the correlations between the variables and noting that there is almost no relationship between age and other variables.

FIGURE 2.5: Correlation between variables

2.5 Visualization of the fitted models

In this section, we demonstrate some graphing techniques that can be used to understand better a regression model that has been fit to data.

We start with plots that allow us to visualize the fitted regression function: the mean charges expressed as a function of the covariates. These plots help to show the estimated role of one variable when the other variables are held fixed. We will also plot 95% simultaneous confidence bands around these fitted lines. Although the estimated mean curve is never exact based on a finite sample of data, we can be 95% confident that the true mean curve falls somewhere within the shaded regions of the plots below.

This type of plot requires us to fix the values of all variables other than the independent variable (Charges here) and one independent variable that we call the focus variable (which is age here). Below we fix the gender as “female” and the BMI as 25. Thus, the graphs below show the relationship between expected Charges and age for women with BMI equal to 25.

The plot analyzed for the number of children is shown below. Age is 50, with the other parameters the same.

FIGURE 2.6: Fitted regression function charge vs age.

The analogous plot for Number of Children is shown next. Here we fix the gender as “female” and the age at 50, so we are looking at the relationship between common charge and age for women of 50.

FIGURE 2.7: Fitted regression function charge vs number of children.

The error band for the number of children is noticeably more significant than the error band for the age, indicating less certainty about children and charge than age and charge.

The discussion has primarily focused on the mean structure of the population, that is, the model for the average change of a person with a given age, gender, BMI, smoker, region, and children. A regression model can also be used to assess the variance structure of the population, that is, how much and in what manner the observations deviate from their mean. We will focus on informal, graphical methods for assessing this.

To begin with, we plot the residuals against the fitted values. Recall that the fitted values are the estimated means for each observation, and the residuals are the difference between an observation and its fitted mean. For example, the model may estimate that a 50-year-old female will have, on average, a charge of 32,000 $. However, a specific 50-year-old female may have a blood pressure of 110 or 150, for example. The fitted values for these women are 32,000, and their residuals are -10,000 and 10,000, respectively.

The most straightforward variance pattern that we can see in a linear regression occurs when the points are scattered around the mean, with the same degree of scattering throughout the range of the covariates. When there are multiple covariates, it is hard to assess whether the variance is uniform throughout this range. However, we can quickly check for a “mean/variance relationship,” in which there is a systematic relationship between the variance and the mean, i.e., the variance either increases or decreases systematically with the mean. The plot of residuals on fitted values is used to assess whether such a mean/variance relationship is present.

FIGURE 2.8: Fitted values vs Residuals.

Above we show the plot of residuals on fitted values for the Insurance data. It appears that we have a modestly increasing mean/variance relationship.

2.6 Conclusion

As can be seen, the linear regression model is a potent tool for analyzing model predictions if more than one variable is used. Using the python statsmodels library, we saw how easy and beneficial it is to analyze statistical models.