Homework week 2

Question 1 : Use the summary function to gain an overview of the data set. Then display the mean and median for at least two attributes of your data.

#Load package
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Load the data
Healthinsurancedata<- read_csv("WEEK2HOMEWORK/HealthInsurance.csv")

## New names:
## • `` -> `...1`

## Rows: 8802 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): health, limit, gender, insurance, married, selfemp, region, ethnici...
## dbl (3): ...1, age, family
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Use summary function to overview data
summary(Healthinsurancedata)

##       ...1         health               age           limit          
##  Min.   :   1   Length:8802        Min.   :18.00   Length:8802       
##  1st Qu.:2201   Class :character   1st Qu.:30.00   Class :character  
##  Median :4402   Mode  :character   Median :39.00   Mode  :character  
##  Mean   :4402                      Mean   :38.94                     
##  3rd Qu.:6602                      3rd Qu.:48.00                     
##  Max.   :8802                      Max.   :62.00                     
##     gender           insurance           married            selfemp         
##  Length:8802        Length:8802        Length:8802        Length:8802       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      family          region           ethnicity          education        
##  Min.   : 1.000   Length:8802        Length:8802        Length:8802       
##  1st Qu.: 2.000   Class :character   Class :character   Class :character  
##  Median : 3.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 3.094                                                           
##  3rd Qu.: 4.000                                                           
##  Max.   :14.000

#Displaying mean and median 
mean_age<- mean(Healthinsurancedata$age)
median_age<- median(Healthinsurancedata$age)

mean_family <- mean(Healthinsurancedata$family)
median_family <- median(Healthinsurancedata$family)

#Printing the result
cat("Mean of Age is:", mean_age, "\n")

## Mean of Age is: 38.93683

cat("Median of Age is:", median_age, "\n")

## Median of Age is: 39

cat("Mean of family is", mean_family,"\n")

## Mean of family is 3.093501

cat("Median of family is:", median_family, "\n")

## Median of family is: 3

Question 2: Create a new data frame with a subset of the columns AND rows. There are several ways to do this so feel free to try a couple if you want. Make sure to rename the new data set so it simply just doesn’t write it over

new_df <- Healthinsurancedata %>%
  select("health","age","gender","married", "family")%>%
  slice(1:20)

#Print result
new_df

Question 3: Create new column names for each column in the new data frame created in step 2.

new_colnames <- c("Health_new","age_new","gender_new","married_new","family_new")
colnames(new_df) <- new_colnames

#Print result
new_df

Question 4 : Use the summary function to create an overview of your new data frame created in step 2. The print the mean and median for the same two attributes. Please compare (i.e. tell me how the values changed and why).

summary(new_df)

##   Health_new           age_new       gender_new        married_new       
##  Length:20          Min.   :23.00   Length:20          Length:20         
##  Class :character   1st Qu.:31.75   Class :character   Class :character  
##  Mode  :character   Median :46.00   Mode  :character   Mode  :character  
##                     Mean   :43.35                                        
##                     3rd Qu.:52.25                                        
##                     Max.   :62.00                                        
##    family_new 
##  Min.   :1.0  
##  1st Qu.:2.0  
##  Median :2.5  
##  Mean   :3.5  
##  3rd Qu.:5.0  
##  Max.   :8.0

#Displaying mean and median of new df
mean(new_df$age_new)

## [1] 43.35

median(new_df$age_new)

## [1] 46

mean(new_df$family_new)

## [1] 3.5

median(new_df$family_new)

## [1] 2.5

The values of mean and median for the new data frame is slightly higher than the original values because of the difference of size of the dataset

Question 5 : For at least 3 different/distinct values in a column please rename so that every value in that column is renamed. For example, change the letter “e” to “excellent”, the letter “a” to “average’ and the word “bad” to “terrible”.

new_df <- new_df %>%
  mutate(family_new = case_when(
    family_new == "4" ~ "family of 4",
    family_new == "5" ~ "family of 5",
    family_new == "3" ~ "family of 3",
  ))
#Printing
new_df

Question 6: Display enough rows to see examples of all of steps 1-5 above. This means use a function to show me enough row values that I can see the changes.

head(new_df, 6)

Question 7: Bonus Question

Healthinsurancedata_github <- read_csv("https://raw.githubusercontent.com/Ismoo225/RBRIDGE/main/HealthInsurance.csv")

## New names:
## Rows: 8802 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (9): health, limit, gender, insurance, married, selfemp, region, ethnici... dbl
## (3): ...1, age, family
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

head(Healthinsurancedata_github,6)